Why does the compiler prefer to do loop unrolling rather than vectorization? Is this because the compiler is not sure of the data alignment needed for SIMD? How can it be pushed to vectorize using the example code above?
Vectorization is applied in mean_simd. Does this mean that the compiler has made an assumption that the input data will be aligned? And somehow I have to provide this to him?
The performance of mean_naive and mean_fast differs by a factor of 4. The difference is that mean_fast unrolled the loop more. Is the optimal degree of loop unrolling a difficult problem for the compiler, and loop unrolling must be done manually/experimentally?
As an aside, there's a utility method on slices, as_simd, which can be used to make mean_simd a little more generic:
pub fn mean_simd_slice(x: &[f64]) -> f64 {
const LANES: usize = 4;
assert!(x.len() >= LANES);
let (prefix, main, suffix) = x.as_simd::<LANES>();
let prefix_sum = prefix.iter().sum::<f64>();
let suffix_sum = suffix.iter().sum::<f64>();
let main_sum = main.iter().sum::<Simd<f64, LANES>>().reduce_sum();
(prefix_sum + suffix_sum + main_sum) / x.len() as f64
}
You could set LANES to 64, and be guaranteed use of SIMD registers up to 4096 bits large (64 elements) if the CPU you're running on supports this - it also knows enough to split up into smaller registers, exploiting the fact that you've opted out of associativity, for CPUs with smaller SIMD registers.
This autovectorizes, is safe, works on stable and doesn't need explicit simd:
pub fn mean_fast(a: &[f64]) -> f64 {
const CHUNK_SIZE: usize= 4;
let mut chunks = a.chunks_exact(CHUNK_SIZE);
let mut acc = [0.0; CHUNK_SIZE];
for chunk in &mut chunks {
for (acc, elem) in acc.iter_mut().zip(chunk) {
*acc += elem;
}
}
acc.iter().chain(chunks.remainder()).sum::<f64>() / a.len() as f64
}
The difference with your mean_fast is that if you view the array as a grid where each row has size CHUNK_SIZE you summed them row by row and then combined, while autovectorization requires you to sum column by column and then sum.
Intriguingly, if I increase CHUNK_SIZE in your version to between 8 and 32, it's willing to use 8 element vector registers; if I go to a CHUNK_SIZE of 64 (to match the largest value of LANES that works in mean_simd_slice), the autovectorizer drops back to 4 element registers (256 bit), while mean_simd_slice using portable SIMD still uses 512 bit registers.
See the two examples in one place for comparison - this is set to 32 element chunks, where it all works fine for both, but if you increase to 64, the autovectorizer falls back to 256 bit registers, while portable SIMD uses 512 bit registers still.
Well, the 512bit situation is complicated due to the penalties associated with the power shift-up that some intels need to get the units running at full speed. If you specify concrete CPUs (e.g. znver4) instead of the generic -v4 it'll produce different results.
And I guess those code paths (autovectorization vs. explicit vector types) just hit slightly different heuristics.
I'm just surprised that the heuristics have this drop in register size as the chunk size goes up; having portable SIMD pick 512 bit registers when autovectorization picks 256 bit (or vice-versa) would not be surprising at all for all the reasons you give. But having a chunk size of 32 elements pick 512 bit registers, then a chunk size of 64 elements pick 256 bit registers surprises me.
And if portable SIMD had also dropped down to 256 bit registers when it hit a chunk size of 64 elements, I'd be looking at optimization manuals to see if there's a good reason for this that I'd not picked up. It's the combination of "smaller registers for more elements" and "alternative route still picks big registers" that's intriguing.
Also, I'd not class this as a bug at this stage; for it to count as a bug, I'd have to be able to identify which heuristics are misfiring here.
For some reason, the second version of scalar multiplication works 10x times slower: godbolt
pub fn scalar_product_1(a: &[f64], b: &[f64]) -> f64 {
const LANES: usize = 8;
assert!(a.len() == b.len() && a.len() >= LANES);
let dd = a.len() % LANES;
let preffix_sum = scalar_product_one(&a[..dd], &b[..dd]);
let acc = a[dd..]
.chunks_exact(LANES)
.zip(b[dd..].chunks_exact(LANES))
.map(|(aa, bb)| {
let x = Simd::<f64, LANES>::from_slice(aa);
let y = Simd::<f64, LANES>::from_slice(bb);
x * y
})
.sum::<Simd<f64, LANES>>()
.reduce_sum()
+ preffix_sum;
acc
}
pub fn scalar_product_2(a: &[f64], b: &[f64]) -> f64 {
const LANES: usize = 8;
assert!(a.len() == b.len() && a.len() >= LANES);
let dd = a.len() % LANES;
let preffix_sum = scalar_product_one(&a[..dd], &b[..dd]);
let acc = a[dd..]
.chunks_exact(LANES)
.zip(b[dd..].chunks_exact(LANES))
.fold(Simd::<f64, LANES>::splat(0.0), |acc, (aa, bb)| {
let x = Simd::<f64, LANES>::from_slice(aa);
let y = Simd::<f64, LANES>::from_slice(bb);
x.mul_add(y, acc)
})
.reduce_sum()
+ preffix_sum;
acc
}
Judging by the disassembler, in the second case the VFMADD132PD command is used, which corresponds to the AVX512F processor flag. My processor supports this flag.
Why can there be such a dramatic difference? Due to data alignment?
The second version uses a fused multiply-add operation while the first version does two separate operations (multiply then add). The two are different since a fused multiply-add has more precision, and if supported in the hardware can also be faster