Are vectorization failures due to Rust or LLVM?

mwlon · November 19, 2025, 10:40pm

Here's a simplified piece of extremely performance-sensitive decompression code I would like to use:

pub unsafe fn decompress_offsets(
    base_bit_idx: usize,
    src: &[u8],
    offset_bits_csum_scratch: &[u32],
    offset_bits_scratch: &[u32],
    latents: &mut [u64],
) {
    for (&offset_bits, (&offset_bits_csum, latent)) in offset_bits_scratch.iter().zip(
        offset_bits_csum_scratch
            .iter()
            .zip(latents.iter_mut()),
    ) {
      let bit_idx = base_bit_idx as u32 + offset_bits_csum;
      let byte_idx = bit_idx / 8;
      let bits_past_byte = bit_idx % 8;
      *latent = read_u64_at(
        src,
        byte_idx as usize,
        bits_past_byte,
        offset_bits,
      ).wrapping_add(*latent);
    }
}

#[inline]
unsafe fn read_u64_at(
  src: &[u8],
  byte_idx: usize,
  bits_past_byte: u32,
  n: u32,
) -> u64 {
  debug_assert!(n <= 57);
  let raw_bytes = *(src.as_ptr().add(byte_idx) as *const [u8; 8]);
  let value = u64::from_le_bytes(raw_bytes);
  (value >> bits_past_byte) & ((1 << n) - 1)
}

godbolt link

This vectorizes on x64 but fails to do so on aarch64. I can get some very similar loops to vectorize, if I

remove the final wrapping add, or
write to another dst: &mut [u64] buffer instead of working in-place.

However, I would rather not do those things for performance reasons, and in reality I have several generic versions of this loop, so I can't easily write inline assembly.

Things I've tried:

looked at the LLVM IR. The vectorizing versions have a vector.body section, but I'm not sure if rustc produces that or LLVM does and I'm just looking at IR after all the optimization passes.
looked at the assembly on both platforms. It appears to me that what I want is definitely possible by tweaking the assembly from (1.) above.

So how can I tell if a vectorization failure is due to Rust or LLVM? If the former, how can we improve the compiler in this case? Are there any good workarounds for the moment?

scottmcm · November 19, 2025, 11:29pm

Rust itself does zero auto-vectorization. If you didn't write SIMD (via core::arch or core::simd::Simd) and you're seeing SIMD, it's because LLVM saw what you're doing and vectorized it.

Topic		Replies	Views
Rust and SIMD vectorization help	4	844	September 2, 2020
How to see auto-vectorization in action?	5	814	August 5, 2020
Rust/LLVM Applies Auto Vectorization Inconsistently help	7	1372	December 10, 2021
How does rust get this better loop vectorization?	6	792	December 14, 2019
No auto-vectorization in such case since 1.54.0 help	2	372	March 22, 2022

Are vectorization failures due to Rust or LLVM?

Related topics