Rust/LLVM Applies Auto Vectorization Inconsistently

Hi,

I've been trying to understand why LLVM appears to be applying auto-vectorization to my code in very different ways despite there only being minor differences to the code.

I'm compiling this code with rustc-1.55.0 with the following compiler flags -C target-cpu=skylake -C opt-level=3

 pub fn sum_buffer(buf:  &[u8]) -> u32 {
    let mut accumulator: u32 = 0;
    let buf = &buf[5..];
    let mut chunks = buf.chunks_exact(2);
    while let Some(&[one, two]) = chunks.next() {
        let word: u16 = u16::from_ne_bytes([one, two]);
        accumulator = accumulator.wrapping_add(u32::from(word));
    }
    return accumulator;
}

and its associated assembly code from godbolt.

.LBB0_5:
  vpmovzxwd 5(%rdi,%rax,2), %ymm4
  vpaddd %ymm0, %ymm4, %ymm0
  vpmovzxwd 21(%rdi,%rax,2), %ymm4
  vpaddd %ymm1, %ymm4, %ymm1
  vpmovzxwd 37(%rdi,%rax,2), %ymm4
  vpaddd %ymm2, %ymm4, %ymm2
  vpmovzxwd 53(%rdi,%rax,2), %ymm4

With the above code, rustc and LLVM recognize that it can perform auto-vectorization and does so, resulting in vpmovzxwd and vpaddd instructions to calculate the sum. The optimization is applied for any let buf = &buf[1..] line I add except for let buf = &buf[2..]. If I remove let buf = &buf[5..] results in the following code

 pub fn sum_buffer(buf:  &[u8]) -> u32 {
    let mut accumulator: u32 = 0;
    let mut chunks = buf.chunks_exact(2);
    while let Some(&[one, two]) = chunks.next() {
        let word: u16 = u16::from_ne_bytes([one, two]);
        accumulator = accumulator.wrapping_add(u32::from(word));
    }
    return accumulator;
}

and its associated assembly code from godbolt.

.LBB0_9:
  movzwl (%rdi), %esi
  addl %eax, %esi
  movzwl 2(%rdi), %eax
  addl %esi, %eax
  movzwl 4(%rdi), %esi
  addl %eax, %esi
  movzwl 6(%rdi), %eax
  addl %esi, %eax
  movzwl 8(%rdi), %esi
  addl %eax, %esi
  movzwl 10(%rdi), %eax
  addl %esi, %eax
  movzwl 12(%rdi), %esi
  addl %eax, %esi
  movzwl 14(%rdi), %eax
  addq $16, %rdi
  addl %esi, %eax
  addq $8, %rdx
  jne .LBB0_9

auto-vectorization is not applied and it uses standard mov and add instructions.

Godbolt link for code: Compiler Explorer.

I've looked at the HIR and MIR generated by rustc and only see minor differences. The first major differences begin to appear in the LLVM IR where the first example begins to have vector.body sections appear.

Could someone help explain to me why very similar MIR is converted to LLVM IR so differently? Is there any way that I can eliminate the let buf = &buf[5..] line, but continue to have my code auto-vectorized? This leads to pretty substantial speed ups when I test with Criterion that I would like to preserve if possible.

Thanks!

1 Like
 pub fn sum_buffer(buf:  &[u8]) -> u32 {
    buf.chunks_exact(2).fold(0, |mut accum, x| {
         if let &[one, two] = x { 
          let word: u16 = u16::from_ne_bytes([one, two]);
          accum = accum.wrapping_add(u32::from(word));
         }
         accum
    })

}

compiler explorer

 pub fn sum_buffer(buf:  &[u8]) -> u32 {
    let mut accumulator: u32 = 0;
    buf.chunks_exact(2).for_each(| x | {
         if let &[one, two] = x { 
          let word: u16 = u16::from_ne_bytes([one, two]);
          accumulator = accumulator.wrapping_add(u32::from(word));
         }
    });
    return accumulator;
}

compiler explorer
I don't have an explanation for why the autovectorization didn't work in your case.

1 Like

In theory you can get LLVM to tell you about its vectorization decisions with -C llvm-args='--pass-remarks=vectorize', but that's not producing any output when I run it, with or without let buf = &buf[5..];… maybe another of the many options listed by rustc -C llvm-args='--help-list-hidden' would be useful here?

1 Like

I was able to get some output on my local desktop using the following LLVM args llvm-args="--pass-remarks=.*vector.* --pass-remarks-analysis=.*vector.*"

For the code with let buf = &buf[5..] I see

remark: <unknown>:0:0: Disabling scalable vectorization, because target does not support scalable vectors.
remark: <unknown>:0:0: vectorized loop (vectorization width: 8, interleaved count: 4)

For the code without the let buf = &buf[5..] I saw

remark: <unknown>:0:0: loop not vectorized: value that could not be identified as reduction is used outside the loop

I'm still trying to figure out what loop not vectorized: value that could not be identified as reduction is used outside the loop means, but it looks like LLVM is optimizing them differently for some reason.

1 Like

Ah, right, you have to give it a regex. I'm guessing value that could not be identified as a reduction refers to accumulator?

For sufficiently-small-amount-of-work loop bodies, I think you'll find internal iteration is far more reliable.

I'll toss out this version, for fun:

#![feature(slice_as_chunks)]
pub fn sum_buffer(buf: &[u8]) -> u32 {
    let (chunks, tail) = buf.as_chunks::<2>();
    assert!(tail.is_empty());

    chunks
        .iter()
        .cloned()
        .map(u16::from_ne_bytes)
        .map(u32::from)
        .sum()
}

It actually ended up being something in slice/iter.rs when I ran the command with debug symbols enabled:

remark: /rustc/c8dfcfe046a7680554bf4eb612bad840e7631c4b/library/core/src/slice/iter.rs:1706:9: vectorized loop (vectorization width: 8, interleaved count: 4)

I'm guessing it is some combination of loop vectorization optimization plus some other LLVM pass(es) that are generating the optimized instructions for the addition. I guess I'll just need to dig through those to see which one it actually is.

Thanks for all the help!

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.