I wrote the below test code below to do an empirical statistics test (it approximates the propability you can form a triangle by breaking a line in two random locations).
I cannot get the main loop formed by the folded iter to produce SIMD instructions. Is there something inherent to the fold closure that would prevent that?
use rand::random_iter;
pub fn main() {
let iterations:usize = 1000000000;
let triangles = random_iter::<u32>()
.zip(random_iter::<u32>())
.take(iterations)
.fold(0, |acc, (br1,br2)| acc + ((br1.abs_diff(br2) < 0x80000000) as u32));
println!("{:#10}", triangles);
println!("----------");
println!("{:?}", iterations);
}
Without taking a look in Godbolt, the immediate main suspect is random_iter which is inherently not free of data dependencies. That is, every random number emitted depends on the previous one (or more precisely the previous PRNG state), and that's poison to the autovectorizer. It's not clear how this would be usefully vectorized by hand either, because the cost of the serial component (the random numbers) almost certainly dominates the parallelizable component (the folder). Using a 128-bit RNG would work because it could emit four u32's worth of bits at a time.
This would be very tricky to auto-vectorize, but technically, the default ChaCha generator's state doesn't depend on the random outputs (just an increasing counter), so it's possible to generate outputs in parallel. I believe it already generates 16 u32's at a time using SIMD.
the default ChaCha generator's state doesn't depend on the random outputs (just an increasing counter), so it's possible to generate outputs in parallel.
Does an increasing counter not disable auto vectorisation? I was thinking maybe the accumulator in the fold would also make (auto) vectorisation difficult.