Why is Sum::sum so much slower than a manual loop?

mistodon · July 13, 2018, 6:56pm

So I noticed this while profiling my path tracer. My dot-product was using sum internally and I was getting ~14 FPS - and with a normal loop, I was getting ~23 FPS.

This playground link shows the same issue (for me anyway): Rust Playground

This is in release mode too. I'd expect it to optimize down to roughly the same thing. Is there anything I can do to encourage that optimization? Or should I just... write my own loop?

mbrubeck · July 13, 2018, 7:13pm

Performance on the playground doesn't seem to be very stable. For example, just now I got:

Sum was 499998500001, calculation took: Duration { secs: 0, nanos: 422543 }
Loop sum was 499998500001, calculation took: Duration { secs: 0, nanos: 779491 }

On my own system, the results were still quite noisy but the average for sum was about 20% faster than the for loop.

mistodon · July 13, 2018, 7:27pm

Interesting... I suppose I'm glad it's not a general problem with sum then. Though I'm not sure how to diagnose exactly why it's so much slower in my case.

quadrupleslap · July 13, 2018, 7:29pm

If you swap the order around the first one is still slower than the second one, so my bet is on cache shenanigans.

Edit: By "first one is still slower" I meant that the sum becomes the faster one and the loop becomes the slower one.

HadrienG · July 13, 2018, 7:39pm

Let's ask Godbolt!

A quick diff shows that the generated assembly is identical for both summation methods, label names aside.

cuviper · July 13, 2018, 7:43pm

In more complicated iterators, it might matter that Sum uses a fold instead of an open for-loop. This ought to benefit performance in most cases, or at least be neutral, but perhaps you've found an exception?

HadrienG · July 13, 2018, 7:52pm

As a more general advice, I would suggest going for longer-lived microbenchmarks in the future. I personally always aim for something which runs for at least a couple of seconds. The reason is that the shorter your benchmark is, the more sensitive it gets to OS and hardware details:

A misplaced context switch in the OS can dramatically effect milisecond-scale timings.
Modern CPUs have an optimization for burst workloads where they temporarily operate over their nominal frequency (Intel calls this Turbo).
Conversely, some CPU features are so power-intensive that they cause CPUs to drop below their nominal frequency after operating for a while (most AVX impls are guilty of this).
As @quadrupleslap pointed out, it takes a while for caches (including branch predictors and the like) to warm up.

Microbenchmarking is generally a dark art, but the shorter your timings are, the more difficult it gets.

quadrupleslap · July 13, 2018, 7:54pm

Python has timeit for stuff like this, but I don't think there's a (popular) equivalent for Rust.

HadrienG · July 13, 2018, 7:55pm

Isn't that what criterion and the unstable cargo bench are supposed to do?

quadrupleslap · July 13, 2018, 7:56pm

Nevermind, I guess there is a popular equivalent! Thanks.

Edit: And it has charts too?!

kornel · July 13, 2018, 10:23pm

Do not write benchmark code yourself — use Rust's built-in brencher instead. It will try to run enough iterations for the test to be meaningful.

Topic		Replies	Views
Simple conditional loop code review	12	824	March 7, 2021
Simple Rust and C# performance comparison help	12	12991	September 19, 2020
We all know `iter` is faster than `loop`, but why?	23	14342	February 16, 2021
Rust autovectorization issues help	9	426	May 31, 2025
Clippy driving me to insanity, insisting on iterators	42	5248	April 13, 2020

Why is Sum::sum so much slower than a manual loop?

Related topics