Is My Criterion Benchmarking Code Actually Answering My Question?

The problem is that it's extremely difficult to extrapolate the results of nano-benchmarks to real code -- especially on modern super-scalar desktop chips. Not to mention what the optimizer will do with code -- if you measure general division, for example, the results are irrelevant for x / 10 because the compiler doesn't use division to do that.

CAD97 put together some great benchmarks in Converting a BGRA &[u8] to RGB [u8;N] (for images)? - #13 by CAD97 that show just how hard it is to understand how something will perform in aggregate. A bunch of operations show up as essentially free because of ILP and speculation and such -- in fact, one of the ones with the most instructions ends up being one of the fastest, and the one that's the fewest instructions is one of the slowest.

So it's critical to find a bigger chunk to measure. Ideally something with a meaningful loop that can run both smaller and larger instances of the problem -- how the unrolling & vectorization ends up can often be more important than how a single body run performs in isolation.

5 Likes