How to convince to autovectorize code?

I have written several implementations for mean calculating:

#![feature(portable_simd)]
use std::simd::num::SimdFloat;
use std::simd::{f64x4};

pub fn mean_naive(a: &[f64]) -> f64 {
    a.iter().sum::<f64>() / a.len() as f64
}

pub fn mean_fast(a: &[f64]) -> f64 {
    let chunk_size = 4;
    assert!(a.len() >= chunk_size);
    let mut i: usize = 0;
    let r: f64 = a
        .chunks_exact(chunk_size)
        .map(|x| {
            i += chunk_size;
            x.iter().sum::<f64>()
        })
        .sum();
    (r + a[i..].iter().sum::<f64>()) / a.len() as f64
}

pub fn mean_simd(x: &[f64]) -> f64 {
    let chunk_size = 4;
    assert!(x.len() >= chunk_size);

    let mut i: usize = 0;
    let sum = x
        .chunks_exact(chunk_size)
        .fold(f64x4::splat(0.), |sum, chunk| {
            i += chunk_size;
            sum + f64x4::from_slice(chunk)
        });
    let r = sum.reduce_sum();
    (r + x[i..].iter().sum::<f64>()) / x.len() as f64

https://rust.godbolt.org/z/1cT6Phvzv

Performance measurements are something like this:

mean_naive_bench          ... bench:       2,102 ns/iter (+/- 143)
mean_fast_bench           ... bench:         539 ns/iter (+/- 30)
mean_simd_bench           ... bench:         522 ns/iter (+/- 32
  1. Why does the compiler prefer to do loop unrolling rather than vectorization? Is this because the compiler is not sure of the data alignment needed for SIMD? How can it be pushed to vectorize using the example code above?
  2. Vectorization is applied in mean_simd. Does this mean that the compiler has made an assumption that the input data will be aligned? And somehow I have to provide this to him?
  3. The performance of mean_naive and mean_fast differs by a factor of 4. The difference is that mean_fast unrolled the loop more. Is the optimal degree of loop unrolling a difficult problem for the compiler, and loop unrolling must be done manually/experimentally?

Because float addition isn't associative, so it's not allowed.

Write it for i32 addition and it'll vectorize too.

2 Likes

As an aside, there's a utility method on slices, as_simd, which can be used to make mean_simd a little more generic:

pub fn mean_simd_slice(x: &[f64]) -> f64 {
    const LANES: usize = 4;
    assert!(x.len() >= LANES);
    let (prefix, main, suffix) = x.as_simd::<LANES>();

    let prefix_sum = prefix.iter().sum::<f64>();
    let suffix_sum = suffix.iter().sum::<f64>();
    let main_sum = main.iter().sum::<Simd<f64, LANES>>().reduce_sum();

    (prefix_sum + suffix_sum + main_sum) / x.len() as f64
}

You could set LANES to 64, and be guaranteed use of SIMD registers up to 4096 bits large (64 elements) if the CPU you're running on supports this - it also knows enough to split up into smaller registers, exploiting the fact that you've opted out of associativity, for CPUs with smaller SIMD registers.

3 Likes

This autovectorizes, is safe, works on stable and doesn't need explicit simd:

pub fn mean_fast(a: &[f64]) -> f64 {
    const CHUNK_SIZE: usize= 4;
    let mut chunks = a.chunks_exact(CHUNK_SIZE);
    let mut acc = [0.0; CHUNK_SIZE];

    for chunk in &mut chunks {
        for (acc, elem) in acc.iter_mut().zip(chunk) {
            *acc += elem;
        }
    }

    acc.iter().chain(chunks.remainder()).sum::<f64>() / a.len() as f64
}

The difference with your mean_fast is that if you view the array as a grid where each row has size CHUNK_SIZE you summed them row by row and then combined, while autovectorization requires you to sum column by column and then sum.

2 Likes

fadd_fast ought to help with that

1 Like

Intriguingly, if I increase CHUNK_SIZE in your version to between 8 and 32, it's willing to use 8 element vector registers; if I go to a CHUNK_SIZE of 64 (to match the largest value of LANES that works in mean_simd_slice), the autovectorizer drops back to 4 element registers (256 bit), while mean_simd_slice using portable SIMD still uses 512 bit registers.

See the two examples in one place for comparison - this is set to 32 element chunks, where it all works fine for both, but if you increase to 64, the autovectorizer falls back to 256 bit registers, while portable SIMD uses 512 bit registers still.

Well, the 512bit situation is complicated due to the penalties associated with the power shift-up that some intels need to get the units running at full speed. If you specify concrete CPUs (e.g. znver4) instead of the generic -v4 it'll produce different results.

And I guess those code paths (autovectorization vs. explicit vector types) just hit slightly different heuristics.

1 Like

I'm just surprised that the heuristics have this drop in register size as the chunk size goes up; having portable SIMD pick 512 bit registers when autovectorization picks 256 bit (or vice-versa) would not be surprising at all for all the reasons you give. But having a chunk size of 32 elements pick 512 bit registers, then a chunk size of 64 elements pick 256 bit registers surprises me.

And if portable SIMD had also dropped down to 256 bit registers when it hit a chunk size of 64 elements, I'd be looking at optimization manuals to see if there's a good reason for this that I'd not picked up. It's the combination of "smaller registers for more elements" and "alternative route still picks big registers" that's intriguing.

Also, I'd not class this as a bug at this stage; for it to count as a bug, I'd have to be able to identify which heuristics are misfiring here.