IteratorILP: Speeding up iterator reductions with instruction-level parallelism

Hi everyone!

Lately, I've been contributing to @vorner's excellent slipstream crate, and in this context I've been wanting to contribute better examples that truly reach peak hardware throughput.

After all, SIMD is all about performance, so if the crate's examples do not push the hardware to its limit, it may give newcomers familiar with the right orders of magnitude the wrong impression that the abstraction is not optimal and they would be better off using lower-level mechanisms like hardware intrinsics instead.

But one key to reaching peak hardware throughput, besides SIMD, is superscalar execution, which requires instruction-level parallelism. And unfortunately, neither simple loops nor iterator reduction methods like sum() exhibit instruction-level parallelism when floating point data is involved, you need to apply extra loop unrolling for that.

Since adding loop unrolling greatly increases the cognitive complexity of simple code, @vorner was understandably concerned about the complexity of the new examples, so I tried to fix that.

My proposed fix is the iterator_ilp crate. It provides the IteratorILP iterator extension, which lets you apply loop unrolling just by taking a standard iterator reduction, adding _ilp to its name, and setting an unrolling factor, i.e. you go from...

let sum = floats.iter().sum::<f32>();

use iterator_ilp::IteratorILP;
let sum = floats.iter().sum_ilp::<16, f32>();

...and voila, your floating-point reduction now has 16-ways instruction level parallelism for the compiler and hardware to take advantage of!

Now of course, there are a couple of caveats:

  • The iterator must have a known length that unsafe code can rely on (which is pretty much the definition of the unstable TrustedLen trait, but since it's unstable I needed to roll my own copy of it, which is not so nice). => Turns out that was too strong a guarantee, and I only need a reliable lower bound on iterator length, which enables implementing IteratorILP for all iterators whose implementation of size_hint is presumed to be correct.
  • The reduction you are performing should not be sensitive to the number and order of iterator items that it observes.
  • Sum and Product are a mess and I needed to use completely different (but hopefully practically equivalent) trait bounds to achieve the same result.
  • And loop unrolling is a tricky optimization that should be applied sparingly and judiciously.

But all in all, that still sounds like a generally useful pattern worth sharing with the world, so there you have it :slight_smile:

Bear in mind the reason why the compiler often doesn't manually vectorize floating point operations is they usually aren't associative, so such optimizations are invalid. Not something the average user has to worry about too much, but it's worth keeping in mind.

Well, they are invalid in the context of a compiler that tries to preserve the semantics of a scalar loop. But often, users don't actually want a scalar loop. For example, in the common scenario of summing FP numbers of identical sign and comparable magnitude, scalar sum is the worst solution from a round-off error perspective.

I tried to explain the underlying situation in the crate documentation, please feel free to read through it and tell me if you think I should have spelled out something better.