Are vectorization failures due to Rust or LLVM?

Here's a simplified piece of extremely performance-sensitive decompression code I would like to use:

pub unsafe fn decompress_offsets(
    base_bit_idx: usize,
    src: &[u8],
    offset_bits_csum_scratch: &[u32],
    offset_bits_scratch: &[u32],
    latents: &mut [u64],
) {
    for (&offset_bits, (&offset_bits_csum, latent)) in offset_bits_scratch.iter().zip(
        offset_bits_csum_scratch
            .iter()
            .zip(latents.iter_mut()),
    ) {
      let bit_idx = base_bit_idx as u32 + offset_bits_csum;
      let byte_idx = bit_idx / 8;
      let bits_past_byte = bit_idx % 8;
      *latent = read_u64_at(
        src,
        byte_idx as usize,
        bits_past_byte,
        offset_bits,
      ).wrapping_add(*latent);
    }
}

#[inline]
unsafe fn read_u64_at(
  src: &[u8],
  byte_idx: usize,
  bits_past_byte: u32,
  n: u32,
) -> u64 {
  debug_assert!(n <= 57);
  let raw_bytes = *(src.as_ptr().add(byte_idx) as *const [u8; 8]);
  let value = u64::from_le_bytes(raw_bytes);
  (value >> bits_past_byte) & ((1 << n) - 1)
}

godbolt link

This vectorizes on x64 but fails to do so on aarch64. I can get some very similar loops to vectorize, if I

  1. remove the final wrapping add, or
  2. write to another dst: &mut [u64] buffer instead of working in-place.

However, I would rather not do those things for performance reasons, and in reality I have several generic versions of this loop, so I can't easily write inline assembly.

Things I've tried:

  • looked at the LLVM IR. The vectorizing versions have a vector.body section, but I'm not sure if rustc produces that or LLVM does and I'm just looking at IR after all the optimization passes.
  • looked at the assembly on both platforms. It appears to me that what I want is definitely possible by tweaking the assembly from (1.) above.

So how can I tell if a vectorization failure is due to Rust or LLVM? If the former, how can we improve the compiler in this case? Are there any good workarounds for the moment?

Rust itself does zero auto-vectorization. If you didn't write SIMD (via core::arch or core::simd::Simd) and you're seeing SIMD, it's because LLVM saw what you're doing and vectorized it.

1 Like