Why is Sum::sum so much slower than a manual loop?

So I noticed this while profiling my path tracer. My dot-product was using sum internally and I was getting ~14 FPS - and with a normal loop, I was getting ~23 FPS.

This playground link shows the same issue (for me anyway): http://play.rust-lang.org/?gist=564c477ac1347b452ebd2ae784094875&version=stable&mode=release&edition=2015

This is in release mode too. I’d expect it to optimize down to roughly the same thing. Is there anything I can do to encourage that optimization? Or should I just… write my own loop?

1 Like

Performance on the playground doesn’t seem to be very stable. For example, just now I got:

Sum was 499998500001, calculation took: Duration { secs: 0, nanos: 422543 }
Loop sum was 499998500001, calculation took: Duration { secs: 0, nanos: 779491 }

On my own system, the results were still quite noisy but the average for sum was about 20% faster than the for loop.

Interesting… I suppose I’m glad it’s not a general problem with sum then. Though I’m not sure how to diagnose exactly why it’s so much slower in my case.

If you swap the order around the first one is still slower than the second one, so my bet is on cache shenanigans.

Edit: By “first one is still slower” I meant that the sum becomes the faster one and the loop becomes the slower one.

2 Likes

Let’s ask Godbolt!

A quick diff shows that the generated assembly is identical for both summation methods, label names aside.

14 Likes

In more complicated iterators, it might matter that Sum uses a fold instead of an open for-loop. This ought to benefit performance in most cases, or at least be neutral, but perhaps you’ve found an exception?

As a more general advice, I would suggest going for longer-lived microbenchmarks in the future. I personally always aim for something which runs for at least a couple of seconds. The reason is that the shorter your benchmark is, the more sensitive it gets to OS and hardware details:

  • A misplaced context switch in the OS can dramatically effect milisecond-scale timings.
  • Modern CPUs have an optimization for burst workloads where they temporarily operate over their nominal frequency (Intel calls this Turbo).
  • Conversely, some CPU features are so power-intensive that they cause CPUs to drop below their nominal frequency after operating for a while (most AVX impls are guilty of this).
  • As @quadrupleslap pointed out, it takes a while for caches (including branch predictors and the like) to warm up.

Microbenchmarking is generally a dark art, but the shorter your timings are, the more difficult it gets.

5 Likes

Python has timeit for stuff like this, but I don’t think there’s a (popular) equivalent for Rust.

1 Like

Isn’t that what criterion and the unstable cargo bench are supposed to do?

2 Likes

Nevermind, I guess there is a popular equivalent! Thanks.

Edit: And it has charts too?!

1 Like

Do not write benchmark code yourself — use Rust’s built-in brencher instead. It will try to run enough iterations for the test to be meaningful.

4 Likes