While looking for performance troubles in my code, I think I have found a cause, the % done on a u64 constant. This is a synthetic Rust benchmark:
fn main() {
const M: u64 = 10_000_000_000_000_000;
let mut tot = 0;
for i in 1_000_000_000_000_000_000 .. 1_000_000_000_080_000_000 {
tot += i % M;
}
println!("{}", tot);
}
Rustc gives me asm like this, it uses divq and unrolls the loop four times:
.Ltmp2:
.seh_endprologue
movabsq $1000000000000000000, %r8
movq $0, 32(%rsp)
movabsq $10000000000000000, %r10
xorl %ecx, %ecx
leaq 80000000(%r8), %r11
.p2align 4, 0x90
.LBB0_1:
xorl %edx, %edx
movq %r8, %rax
divq %r10
movq %rdx, %r9
addq %rcx, %r9
leaq 1(%r8), %rax
xorl %edx, %edx
divq %r10
movq %rdx, %rcx
addq %r9, %rcx
leaq 2(%r8), %rax
xorl %edx, %edx
divq %r10
movq %rdx, %r9
addq %rcx, %r9
leaq 3(%r8), %rax
xorl %edx, %edx
divq %r10
movq %rdx, %rcx
addq %r9, %rcx
addq $4, %r8
cmpq %r11, %r8
jne .LBB0_1
This is similar C code:
#include <stdio.h>
typedef unsigned __int64 u64;
int main() {
const u64 M = 10000000000000000;
u64 tot = 0;
for (u64 i = 1000000000000000000; i < 1000000000080000000; i++) {
tot += i % M;
}
printf("%llu\n", tot);
return 0;
}
Gcc 6.1.0 gives me asm like this, it doesn't unroll the loop, and uses a mulq a imulq and some more things:
xorl %r8d, %r8d
movabsq $1000000000000000000, %rcx
movabsq $4153837486827862103, %r11
movabsq $10000000000000000, %r10
movabsq $1000000000080000000, %r9
.p2align 4,,10
.L2:
movq %rcx, %rax
mulq %r11
movq %rcx, %rax
addq $1, %rcx
shrq $51, %rdx
imulq %r10, %rdx
subq %rdx, %rax
addq %rax, %r8
cmpq %r9, %rcx
jne .L2
The output is 3199999960000000, the Rust code runs (with -O) to me in about 0.54 seconds, the C code in about 0.07 seconds.
I guess this is a small missing LLVM optimization. In the meantime do you know how I can "decompile" the C-derived asm to write equivalent Rust code?