Hi,
I've been trying to understand why LLVM appears to be applying auto-vectorization to my code in very different ways despite there only being minor differences to the code.
I'm compiling this code with rustc-1.55.0 with the following compiler flags -C target-cpu=skylake -C opt-level=3
pub fn sum_buffer(buf: &[u8]) -> u32 {
let mut accumulator: u32 = 0;
let buf = &buf[5..];
let mut chunks = buf.chunks_exact(2);
while let Some(&[one, two]) = chunks.next() {
let word: u16 = u16::from_ne_bytes([one, two]);
accumulator = accumulator.wrapping_add(u32::from(word));
}
return accumulator;
}
and its associated assembly code from godbolt.
.LBB0_5:
vpmovzxwd 5(%rdi,%rax,2), %ymm4
vpaddd %ymm0, %ymm4, %ymm0
vpmovzxwd 21(%rdi,%rax,2), %ymm4
vpaddd %ymm1, %ymm4, %ymm1
vpmovzxwd 37(%rdi,%rax,2), %ymm4
vpaddd %ymm2, %ymm4, %ymm2
vpmovzxwd 53(%rdi,%rax,2), %ymm4
With the above code, rustc and LLVM recognize that it can perform auto-vectorization and does so, resulting in vpmovzxwd and vpaddd instructions to calculate the sum. The optimization is applied for any let buf = &buf[1..]
line I add except for let buf = &buf[2..]
. If I remove let buf = &buf[5..]
results in the following code
pub fn sum_buffer(buf: &[u8]) -> u32 {
let mut accumulator: u32 = 0;
let mut chunks = buf.chunks_exact(2);
while let Some(&[one, two]) = chunks.next() {
let word: u16 = u16::from_ne_bytes([one, two]);
accumulator = accumulator.wrapping_add(u32::from(word));
}
return accumulator;
}
and its associated assembly code from godbolt.
.LBB0_9:
movzwl (%rdi), %esi
addl %eax, %esi
movzwl 2(%rdi), %eax
addl %esi, %eax
movzwl 4(%rdi), %esi
addl %eax, %esi
movzwl 6(%rdi), %eax
addl %esi, %eax
movzwl 8(%rdi), %esi
addl %eax, %esi
movzwl 10(%rdi), %eax
addl %esi, %eax
movzwl 12(%rdi), %esi
addl %eax, %esi
movzwl 14(%rdi), %eax
addq $16, %rdi
addl %esi, %eax
addq $8, %rdx
jne .LBB0_9
auto-vectorization is not applied and it uses standard mov and add instructions.
Godbolt link for code: Compiler Explorer.
I've looked at the HIR and MIR generated by rustc and only see minor differences. The first major differences begin to appear in the LLVM IR where the first example begins to have vector.body
sections appear.
Could someone help explain to me why very similar MIR is converted to LLVM IR so differently? Is there any way that I can eliminate the let buf = &buf[5..]
line, but continue to have my code auto-vectorized? This leads to pretty substantial speed ups when I test with Criterion that I would like to preserve if possible.
Thanks!