Hmm ... if p-values are not useful or accurate enough to be meaningful in this context, then why does criterion, a tool I (and, presumably, many others) implicitly trust by virtue of it's being the de-facto standard for this sort of thing in Rust, prominently display p-values by default?
But even if we ignore the p-values, there's something deeply troubling about the results. Criterion takes tens of millions of iterations per implementation to gather these timings, and produces graphs like this:
This graph strongly implies that the time taken to execute the sample code, is consistently very close to the gradient of the line shown ...
... and in gathering the points on this graph (almost 60 million iterations), a wide variety of ways of executing the code (in terms of Jim Keller's re-ordering of operations at the processor level) should have contributed to this sample.
Then I run criterion again, and it produces a graph which is essentially identical (points very tightly hugging a straight line) except that the gradient of the line has changed by 20%.
This, suggests that at around 12:34 pm on Tuesday, executing this code consistently takes Ts, but at 12:57pm on the same day, executing the same code consistently takes 1.2Ts (or 0.8Ts).
Put another way, the PDFs for the timings of the same code in subsequent criterion runs, no not overlap! In fact, criterion even produces a plot to this effect:
If find such consistent inconsistency very puzzling. Something is clearly wrong.
If criterion (a tool which supposedly takes statistical fluctuations into account) regularly reports that code which has not changed, has improved or regressed in performance by 20% (p-value = 0.00 when null hypothesis is known to be true), then these reports are worse than useless: they are misleading.
Now, I doubt that criterion is useless: I suspect there's some effect which I'm missing.