I like how Rust nicely exposes these parallelism options, and I’m trying to learn more on this. The playground link shows 3 versions, I was expecting the one with SIMD parallelism over Rayon chunks to be the most optimized. It seems to me (system monitoring and asm) that this version keeps the SIMD optimization while using all my laptop CPUs. However, a simple for_each gives the same times. I think it is relevant that the Rayon version with chunks doesn’t use the CPUs at 100%, differently for the pure SIMD with single auto-vectorized iterator. But I would appreciate your feedback on this as I’m really not an expert, moving from Python to Rust and lower level programming aspects. For example, these questions that I hope may be of general interest. Is it because the CPU isn’t the actual bottleneck? How would it change if rather than chunks over a single vector there were separate vectors? In general, do we expect different behavior with slice, because of possible faster access?
Note that the allocation is about 3/4 of the entire execution (see playground).
Thank you for the numerous threads and support. If you find I missed something, please simply reference it here.