Rewriting loops containing `?` as chains of iterators

scottmcm · September 23, 2020, 11:04pm

I totally agree that providing an iterator can be helpful even over a container without a meaningful order. But that's not a situation in which you're saying "On which trait should I bound this generic parameter?", so I don't see that it's relevant here.

Take an example like an ordering predicate you're using to build a heap. Is there ever a reason "to write the elements to a stream" in that case? I can't imagine one. Nor to "fill up some preallocated space" from them.

This isn't "encourage interior mutability", just the opposite. It's like changing self in PartialOrd::partial_cmp: sure, it's possible using interior mutability, but it's obviously discouraged.

jacg · September 24, 2020, 7:44am

Hmm ... if p-values are not useful or accurate enough to be meaningful in this context, then why does criterion, a tool I (and, presumably, many others) implicitly trust by virtue of it's being the de-facto standard for this sort of thing in Rust, prominently display p-values by default?

But even if we ignore the p-values, there's something deeply troubling about the results. Criterion takes tens of millions of iterations per implementation to gather these timings, and produces graphs like this:

hmm

This graph strongly implies that the time taken to execute the sample code, is consistently very close to the gradient of the line shown ...

... and in gathering the points on this graph (almost 60 million iterations), a wide variety of ways of executing the code (in terms of Jim Keller's re-ordering of operations at the processor level) should have contributed to this sample.

Then I run criterion again, and it produces a graph which is essentially identical (points very tightly hugging a straight line) except that the gradient of the line has changed by 20%.

This, suggests that at around 12:34 pm on Tuesday, executing this code consistently takes Ts, but at 12:57pm on the same day, executing the same code consistently takes 1.2Ts (or 0.8Ts).

Put another way, the PDFs for the timings of the same code in subsequent criterion runs, no not overlap! In fact, criterion even produces a plot to this effect:

hmm

If find such consistent inconsistency very puzzling. Something is clearly wrong.

If criterion (a tool which supposedly takes statistical fluctuations into account) regularly reports that code which has not changed, has improved or regressed in performance by 20% (p-value = 0.00 when null hypothesis is known to be true), then these reports are worse than useless: they are misleading.

Now, I doubt that criterion is useless: I suspect there's some effect which I'm missing.

alice · September 24, 2020, 7:49am

My guess is that the difference in those two runs is what other things are running on the computer. Maybe you were listening to music on the second run?

H2CO3 · September 24, 2020, 8:04am

Probably because there's nothing better to display there. I'm not saying that the tool or its makers are at fault for displaying P-values, I'm saying that they might not mean anything. And if we're stuck with P-values despite them not being useful in some cases, then tough luck…

Again, I'm not saying that it is useless. At this point I'm feeling like we are nitpicking about definitions. If nothing changed in the code, but two subsequent runs were different, what do you call the phenomenon if not "statistical noise"? Surely statistics can't account for every possible kind of error; it might well be that during one of the runs, another process was hogging your CPU while this didn't happen in the other case.

A lot of unknown factors are or may be involved in the concrete physical execution time of a piece of code, and a microbenchmark can't be expected to account for every single one of them. In this way, "trusting the numbers", i.e. making claims or drawing consequences solely based on P-values would be missing the forest for the trees.

jacg · September 24, 2020, 8:04am

As I mentioned above

... but I am taking care to not to do anything directly. I leave my editor and web browser open, but do not use them. I step away from the keyboard and watch criterion do its job. No music, no videos.

But yes, if I'm serious about this, I should shut down everything else, before whining about it any more.

jacg · September 24, 2020, 8:30am

To me, it is clear that there is some effect, beyond statistical fluctuation, which is affecting the results. I have not identified that effect yet, but to call it statistical noise and leave it at that, would be lazy and almost certainly lead to the wrong conclusion.

Perhaps the most obvious candidate for the effect is:

[I'm curious, if this turns out to be the case, would you call it "statistical noise"?]

As I mentioned above, I have taken care to minimize this without going as far as shutting down all other programs and services on my computer, as this would have a huge negative impact on my overall productivity. My current hypothesis is that I have not gone far enough in this respect. But I strongly doubt that it's statistical noise.

Statistical noise would widen the PDFs. Non-overlapping PDFs imply at some real effect.

2e71828 · September 24, 2020, 8:53am

Getting different results for the same test is a strong indicator that your testing strategy doesn’t sufficiently account for the noise profile of your test environment— If the noise has components in the minutes-long frequency band, then a seconds-long test run cannot collect a sufficient sample to average out the noise.

jacg · September 24, 2020, 8:55am

If it has a minutes-long frequency band, then it's not noise, it's an effect.

Edit: sorry, wrong choice of words: it's noise, but it's not statistical fluctuation.

Agree completely. But my point is that "minutes-long frequency band" suggests an identifiable effect.

More pragmatically, without identifying this effect and being able to eliminate it, criterion is useless.

2e71828 · September 24, 2020, 9:07am

Not necessarily. If you can profile the noise, you can come up with a testing schedule that will make its average effect 0. Then it’s a matter of designing your benchmark to have an effect large enough to pick out of the noise floor.

When possible, of course, eliminating the noise source entirely will make for a more sensitive test rig. But “less sensitive” ≠ “useless.”

jacg · September 24, 2020, 9:21am

Very roughly speaking: if I make the benchmark run 24 hours, the noise should be averaged out. Is that what you mean?

If so, yes, I agree. But, for practical purposes, I still classify this as useless: If it takes 24 hours to benchmark, I simply won't bother. We can argue about details like whether it needs to be 24h or whether 5 minutes might do ... but, for practical purposes I need to identify the source of the noise and eliminate it (if (practically ) possible) .

Completely agree, in theory; in practice, the less sensitive it is, the less useful it is. It's not a step function, it's a gradual degradation. So, when I say "it's useless", it's a shorthand for something like "given sufficient insensitivity, it is useless for most practical intents and purposes". Alternatively "very insensitive = almost useless".

In short, I think we agree.

TomP · September 24, 2020, 3:42pm

Is it possible that you have a periodic antivirus scan running on your computer? (I do.) I suggest that you schedule your criterion runs for different times of day and accumulate the resuls in a file indexed by time of day. If there are OS statistics on other concurrent activity, you should log them as well.

alanhkarp · September 24, 2020, 5:39pm

You can watch the activity monitor during the runs. I commonly see tasks pop up in the %CPU column during my runs. Most of the time I have no idea what they are doing.

scottmcm · September 24, 2020, 5:53pm

Remember that your OS also does things randomly per run, with things like ASLR.

We've (Rust) seen things before (can't find the issue right now, sorry) where the performance difference was observable depending whether the inner loop was loaded such that it was on a single 64-byte cache line or spread over two of them.

Micro-benchmarks are just seriously hard.

mbrubeck · September 24, 2020, 8:35pm

Processors also scale their clock speeds based on temperature.

Measuring instruction counts instead of wall-clock time is one way to get more reliable comparisons across benchmark runs, in some cases.

system · December 23, 2020, 8:35pm

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Converting a for loop with error handling into iterator help	12	1187	November 23, 2020
Error handling in a validation function help	8	765	January 12, 2023
Thoughts about `FallibleIterator`?	12	1314	May 11, 2022
Method chain compiles; equivalent local closure does not help	14	519	October 29, 2020
Fallible iterators? help	12	2209	January 12, 2023

Rewriting loops containing `?` as chains of iterators

Related Topics