A performance problem with u64 mod [Solved]

While looking for performance troubles in my code, I think I have found a cause, the % done on a u64 constant. This is a synthetic Rust benchmark:

fn main() {
    const M: u64 = 10_000_000_000_000_000;

    let mut tot = 0;
    for i in 1_000_000_000_000_000_000 .. 1_000_000_000_080_000_000 {
        tot += i % M;
    }
    println!("{}", tot);
}

Rustc gives me asm like this, it uses divq and unrolls the loop four times:

.Ltmp2:
    .seh_endprologue
    movabsq    $1000000000000000000, %r8
    movq    $0, 32(%rsp)
    movabsq    $10000000000000000, %r10
    xorl    %ecx, %ecx
    leaq    80000000(%r8), %r11
    .p2align    4, 0x90
.LBB0_1:
    xorl    %edx, %edx
    movq    %r8, %rax
    divq    %r10
    movq    %rdx, %r9
    addq    %rcx, %r9
    leaq    1(%r8), %rax
    xorl    %edx, %edx
    divq    %r10
    movq    %rdx, %rcx
    addq    %r9, %rcx
    leaq    2(%r8), %rax
    xorl    %edx, %edx
    divq    %r10
    movq    %rdx, %r9
    addq    %rcx, %r9
    leaq    3(%r8), %rax
    xorl    %edx, %edx
    divq    %r10
    movq    %rdx, %rcx
    addq    %r9, %rcx
    addq    $4, %r8
    cmpq    %r11, %r8
    jne    .LBB0_1

This is similar C code:

#include <stdio.h>

typedef unsigned __int64 u64;

int main() {
    const u64 M = 10000000000000000;

    u64 tot = 0;
    for (u64 i = 1000000000000000000; i < 1000000000080000000; i++) {
        tot += i % M;
    }
    printf("%llu\n", tot);
    return 0;
}

Gcc 6.1.0 gives me asm like this, it doesn't unroll the loop, and uses a mulq a imulq and some more things:

    xorl    %r8d, %r8d
    movabsq    $1000000000000000000, %rcx
    movabsq    $4153837486827862103, %r11
    movabsq    $10000000000000000, %r10
    movabsq    $1000000000080000000, %r9
    .p2align 4,,10
.L2:
    movq    %rcx, %rax
    mulq    %r11
    movq    %rcx, %rax
    addq    $1, %rcx
    shrq    $51, %rdx
    imulq    %r10, %rdx
    subq    %rdx, %rax
    addq    %rax, %r8
    cmpq    %r9, %rcx
    jne    .L2

The output is 3199999960000000, the Rust code runs (with -O) to me in about 0.54 seconds, the C code in about 0.07 seconds.

I guess this is a small missing LLVM optimization. In the meantime do you know how I can "decompile" the C-derived asm to write equivalent Rust code?

Here is the Clang-compatible code: Compiler Explorer

You can see that Clang doesn't implement this optimization for this large of a constant.

However, if you use a smaller constant, Clang will optimize it: Compiler Explorer.

And so will rustc: Rust Playground

1 Like

I suggest reading Labor of Division (Episode I). You just need to write the multiplication and shift explicitly in your Rust code, using the same constants as GCC uses.

1 Like

Thank you for your answers, I'll try to do that.
But I'll also probably write an enhancement request for LLVM:

The article says to use the high bits (he uses a 32 bit multiplication, I have to use a 64 bit one, so it's like a 128 bit multiplication):

but takes the high 32 bits of the 64 bit result, not the low 32 bits
like we are used to. x86-64 does the same thing (notice the shift right
instruction operates on %edx, which contains the high bits, not %eax
which is the low bits).<

Can I work on the high bits of the multiplication result without asm in Rust?

Edit1: this GNU-C code seems to show that the numbers are correct:

__int64 mod(const __int64 i) {
    const __int64 M = 10000000000000000;
    const __uint128_t X = 4153837486827862103;
    const __int64 q = (__int64)((X * i) >> (51 + 64));
    return i - (q * M);
}

Using some digital magic GCC optimizes it as well as the natively generated code:

mod:
    movabsq $4153837486827862103, %rax
    mulq    %rcx
    movq    %rdx, %rax
    movabsq $10000000000000000, %rdx
    shrq    $51, %rax
    imulq   %rdx, %rax
    subq    %rax, %rcx
    movq    %rcx, %rax
    ret

GCC is using 64-bit x 64-bit -> 128-bit multiplication, which Rust doesn't expose yet. There are ways to implement the multiplication using only 64-bit x 64-bit -> 64-bit multiplication that Rust supports.

Probably the best way to solve my problem in Rust today is to use a bit of assembly, but adapting code (from here: c - How can I multiply 64 bit operands and get 128 bit result portably? - Stack Overflow ) I've written a pure-Rust function that gives bad-looking asm that is still sufficiently fast for my purposes (if I compile it with -C opt-level=3 -C target-cpu=native ):

#[inline]
fn rem_10e15(i: u64) -> u64 {
    // Returns the most significant part of the multiplication of two u64.
    #[inline]
    fn mul64_hi(mut x: u64, mut y: u64) -> u64 {
        const MASK: u64 = 0x_ffff_ffff;
        let u1 = x & MASK;
        let v1 = y & MASK;
        let mut t = u1 * v1;
        let mut k = t >> 32;

        x >>= 32;
        t = (x * v1) + k;
        k = t & MASK;
        let w1 = t >> 32;

        y >>= 32;
        t = (u1 * y) + k;
        k = t >> 32;

        (x * y) + w1 + k
    }

    const M: u64 = 10000000000000000;
    const X: u64 = 4153837486827862103;
    let q = mul64_hi(i, X) >> 51;
    i - (q * M)
}