Lately, I've been contributing to @vorner's excellent slipstream crate, and in this context I've been wanting to contribute better examples that truly reach peak hardware throughput.
After all, SIMD is all about performance, so if the crate's examples do not push the hardware to its limit, it may give newcomers familiar with the right orders of magnitude the wrong impression that the abstraction is not optimal and they would be better off using lower-level mechanisms like hardware intrinsics instead.
But one key to reaching peak hardware throughput, besides SIMD, is superscalar execution, which requires instruction-level parallelism. And unfortunately, neither simple loops nor iterator reduction methods like
sum() exhibit instruction-level parallelism when floating point data is involved, you need to apply extra loop unrolling for that.
Since adding loop unrolling greatly increases the cognitive complexity of simple code, @vorner was understandably concerned about the complexity of the new examples, so I tried to fix that.
My proposed fix is the iterator_ilp crate. It provides the
IteratorILP iterator extension, which lets you apply loop unrolling just by taking a standard iterator reduction, adding
_ilp to its name, and setting an unrolling factor, i.e. you go from...
let sum = floats.iter().sum::<f32>();
use iterator_ilp::IteratorILP; let sum = floats.iter().sum_ilp::<16, f32>();
...and voila, your floating-point reduction now has 16-ways instruction level parallelism for the compiler and hardware to take advantage of!
Now of course, there are a couple of caveats:
The iterator must have a known length that unsafe code can rely on (which is pretty much the definition of the unstable=> Turns out that was too strong a guarantee, and I only need a reliable lower bound on iterator length, which enables implementing IteratorILP for all iterators whose implementation of size_hint is presumed to be correct.
TrustedLentrait, but since it's unstable I needed to roll my own copy of it, which is not so nice).
- The reduction you are performing should not be sensitive to the number and order of iterator items that it observes.
- Sum and Product are a mess and I needed to use completely different (but hopefully practically equivalent) trait bounds to achieve the same result.
- And loop unrolling is a tricky optimization that should be applied sparingly and judiciously.
But all in all, that still sounds like a generally useful pattern worth sharing with the world, so there you have it