I’m trying to analyze the runtime of a function f() and understand which of its components contribute most to its total execution time.
Using both Criterion and manual timing with Instant::now() (i.e. by doing let start = Instant::now(); at the beginning and then printing start.elapsed() at the end), I measured f() to take roughly 40 ms.
When I use the latter technique on its components, one function g() inside f() seems to take about 17ms, and another h() takes about 21-22ms; this adds up!
When I use criterion however, the latter function h() only takes about 11-12ms. I've tried two variants, but both give the same result:
fn bench_h(c: &mut Criterion) {
let x = g();
c.bench_function(“…”,
|b| {
b.iter(|| {
let y = h(x);
black_box(y);
});
});
fn bench_h(c: &mut Criterion) {
c.bench_function(“…” |b| {
b.iter_with_setup(
let x = g();
},
|x| {
let y = h(x);
black_box(y);
}
This discrepancy is confusing. I'd like to understand:
Why does Criterion report such a different number?
Is it due to optimization, warm-up, CPU frequency scaling, or something else?
What’s the best way to get realistic per-component timings within a larger function?
Manually printing timing info gets messy fast, so I’m hoping there’s a better, more reliable way.
This is incomplete black-boxing; you need to put black_box() around the value x to ensure that h was not optimized with knowledge of x or g. Also, you can return the y value from the closure, which will automatically black-box it and avoid measuring the time it takes to drop y.
b.iter(|| h(black_box(x)));
It almost certainly is due to “optimization” — but not necessarily purposeful actions by the compiler. Any difference in the code at all can cause effectively random changes that lead to the compiler making different decisions — as minor as what order or alignment the resulting machine code is stored in — that happen to affect the execution time. Essentially: your measurement of performance is noisy, and not all of the noise is in the actual execution and time measurement (the part that using a benchmark tool is supposed to help with); some of the noise is in the compiler itself, in the behavior of this enormously complex function from “source code you wrote” to “machine code your computer ran”.
Optimization with inlining means that, in a sense, the answer to this question is “there is no such thing”. When you split an algorithm into separate steps that you can measure, you change how it is executed.
I'm surprised the effects are large enough to notice in a function that takes many ms to run, however, so there could quite possibly be something else going on in the particular case — but you haven’t shared any of the code your benchmark is actually running, so I don’t have much information to speculate on.
Ah I just had another look and this is actually how the benchmark is currently set up. Good to know regardless that this is important, I was not aware of that.
But the rest of f() strikes me as very “sequential” (g() is slow because it’s doing a lot of hashing), and hence hard to optimise relative to its individual components. (ChatGPT is blaming “cache pollution” but the input of h() are just two large vectors which I’m setting up with Vec::with_capacity() so that seems hard to “pollute” I would say?)
Edit: I’ll try to write a small prototype tomorrow and see what happens.
Meanwhile, I’m wondering: if benchmarking g() and h() yields 17ms and 12ms, wouldn’t you expect f() to take at most 17+12ms? And possibly less with optimizations? Under what circumstances would you expect the opposite?
One possible way is that optimizing the two functions together can make them slower. Of course, we don’t want that to happen, and the optimizer isn’t trying to, but remember that the optimizer doesn’t actually know what the important parts of your program are, or exactly what they will be doing — it constantly has to guess “should I assume this branch is likely or that one? should I inline this function? should I unroll this loop?” — and it's possible for it to guess wrong in a way that results in the program being slower. This is what I mean when I say the compiler is a noisy function — we have generally achieved the goal that nearly all programs are faster with optimization than without, but that's not the same as saying that all programs are the most faster they could be.
Profile-guided optimization (PGO) can improve this situation by giving the optimizer feedback about the run-time behavior of the program, so it can make fewer incorrect guesses. I don't know much about it, though.
All that said, the performance difference you are measuring could also be totally due to some measurement error in how the benchmark is set up, or yet another effect. Take all the above not as complete and authoritative advice, but a example of why I “expect the opposite” — or rather, I expect “static” random noise in my measurements and don't trust any one change in the program structure that makes it faster to be the correct thing to do in general — the difference could disappear or flip the other way when I make yet another change. When we run benchmarks we get single-point samples of the enormously complex space of "all programs that produce the same results as this program". We don't have a picture of the whole thing, and since the problem is discrete we can't even ask what the local gradient is.
Also note that there can be performance differences in the hardware for seemingly no apparent reasons. Things like which addresses code is at can affect aliasing in caches, or the branch predictor behaviour etc. Usually these effects are small, but there are pathological cases, especially on older microarchitectures.
For example, I remember there was some older Intel generation (Haswell perhaps? That sort of era anyway) where code alignment could cause surprising performance differences due to only being able to store prediction for one branch per 16 byte(?) region. So depending on how loops were aligned, you could get very different behaviour. (It has been many years, and the details are murky in my mind, but the general principle holds.)
I have also heard of cases where the microops cache had surprising behaviour.
Is this the case for you? No way to tell given what you have posted. Probably not. More likely it is things the compiler is doing. But my point here is that measuring performance is tricky, and microbenchmarks lie. It is better to measure in the context of the full application.
For example, rust std does not use the fastest way to format floating point numbers. Why? Because the fastest way relies on lookup tables, which are likely to lead to cache contention, making the rest of the program slower. Most programs do not spend the majority of their time formatting floats, so it is better to be slightly slower at it, and avoid poisoning the L1 d-cache.
Embarassing update: the next day f() dropped to 30-ish ms. The only reason I can think of is that I had hundreds of browser tabs open which might've been interfering with the memory, somehow. Did a lot of rewriting and will give the code another shot soon...