A performance problem with u64 mod [Solved]


While looking for performance troubles in my code, I think I have found a cause, the % done on a u64 constant. This is a synthetic Rust benchmark:

fn main() {
    const M: u64 = 10_000_000_000_000_000;

    let mut tot = 0;
    for i in 1_000_000_000_000_000_000 .. 1_000_000_000_080_000_000 {
        tot += i % M;
    println!("{}", tot);

Rustc gives me asm like this, it uses divq and unrolls the loop four times:

    movabsq    $1000000000000000000, %r8
    movq    $0, 32(%rsp)
    movabsq    $10000000000000000, %r10
    xorl    %ecx, %ecx
    leaq    80000000(%r8), %r11
    .p2align    4, 0x90
    xorl    %edx, %edx
    movq    %r8, %rax
    divq    %r10
    movq    %rdx, %r9
    addq    %rcx, %r9
    leaq    1(%r8), %rax
    xorl    %edx, %edx
    divq    %r10
    movq    %rdx, %rcx
    addq    %r9, %rcx
    leaq    2(%r8), %rax
    xorl    %edx, %edx
    divq    %r10
    movq    %rdx, %r9
    addq    %rcx, %r9
    leaq    3(%r8), %rax
    xorl    %edx, %edx
    divq    %r10
    movq    %rdx, %rcx
    addq    %r9, %rcx
    addq    $4, %r8
    cmpq    %r11, %r8
    jne    .LBB0_1

This is similar C code:

#include <stdio.h>

typedef unsigned __int64 u64;

int main() {
    const u64 M = 10000000000000000;

    u64 tot = 0;
    for (u64 i = 1000000000000000000; i < 1000000000080000000; i++) {
        tot += i % M;
    printf("%llu\n", tot);
    return 0;

Gcc 6.1.0 gives me asm like this, it doesn’t unroll the loop, and uses a mulq a imulq and some more things:

    xorl    %r8d, %r8d
    movabsq    $1000000000000000000, %rcx
    movabsq    $4153837486827862103, %r11
    movabsq    $10000000000000000, %r10
    movabsq    $1000000000080000000, %r9
    .p2align 4,,10
    movq    %rcx, %rax
    mulq    %r11
    movq    %rcx, %rax
    addq    $1, %rcx
    shrq    $51, %rdx
    imulq    %r10, %rdx
    subq    %rdx, %rax
    addq    %rax, %r8
    cmpq    %r9, %rcx
    jne    .L2

The output is 3199999960000000, the Rust code runs (with -O) to me in about 0.54 seconds, the C code in about 0.07 seconds.

I guess this is a small missing LLVM optimization. In the meantime do you know how I can “decompile” the C-derived asm to write equivalent Rust code?


Here is the Clang-compatible code: https://godbolt.org/g/PpDwE5

You can see that Clang doesn’t implement this optimization for this large of a constant.

However, if you use a smaller constant, Clang will optimize it: https://godbolt.org/g/ZkIfQV.

And so will rustc: https://is.gd/6w1pKf


I suggest reading http://ridiculousfish.com/blog/posts/labor-of-division-episode-i.html. You just need to write the multiplication and shift explicitly in your Rust code, using the same constants as GCC uses.


Thank you for your answers, I’ll try to do that.
But I’ll also probably write an enhancement request for LLVM:


The article says to use the high bits (he uses a 32 bit multiplication, I have to use a 64 bit one, so it’s like a 128 bit multiplication):

but takes the high 32 bits of the 64 bit result, not the low 32 bits
like we are used to. x86-64 does the same thing (notice the shift right
instruction operates on %edx, which contains the high bits, not %eax
which is the low bits).<

Can I work on the high bits of the multiplication result without asm in Rust?

Edit1: this GNU-C code seems to show that the numbers are correct:

__int64 mod(const __int64 i) {
    const __int64 M = 10000000000000000;
    const __uint128_t X = 4153837486827862103;
    const __int64 q = (__int64)((X * i) >> (51 + 64));
    return i - (q * M);

Using some digital magic GCC optimizes it as well as the natively generated code:

    movabsq $4153837486827862103, %rax
    mulq    %rcx
    movq    %rdx, %rax
    movabsq $10000000000000000, %rdx
    shrq    $51, %rax
    imulq   %rdx, %rax
    subq    %rax, %rcx
    movq    %rcx, %rax


GCC is using 64-bit x 64-bit -> 128-bit multiplication, which Rust doesn’t expose yet. There are ways to implement the multiplication using only 64-bit x 64-bit -> 64-bit multiplication that Rust supports.


Probably the best way to solve my problem in Rust today is to use a bit of assembly, but adapting code (from here: http://stackoverflow.com/questions/25095741/how-can-i-multiply-64-bit-operands-and-get-128-bit-result-portably ) I’ve written a pure-Rust function that gives bad-looking asm that is still sufficiently fast for my purposes (if I compile it with -C opt-level=3 -C target-cpu=native ):

fn rem_10e15(i: u64) -> u64 {
    // Returns the most significant part of the multiplication of two u64.
    fn mul64_hi(mut x: u64, mut y: u64) -> u64 {
        const MASK: u64 = 0x_ffff_ffff;
        let u1 = x & MASK;
        let v1 = y & MASK;
        let mut t = u1 * v1;
        let mut k = t >> 32;

        x >>= 32;
        t = (x * v1) + k;
        k = t & MASK;
        let w1 = t >> 32;

        y >>= 32;
        t = (u1 * y) + k;
        k = t >> 32;

        (x * y) + w1 + k

    const M: u64 = 10000000000000000;
    const X: u64 = 4153837486827862103;
    let q = mul64_hi(i, X) >> 51;
    i - (q * M)