Why is Sum::sum so much slower than a manual loop?


#1

So I noticed this while profiling my path tracer. My dot-product was using sum internally and I was getting ~14 FPS - and with a normal loop, I was getting ~23 FPS.

This playground link shows the same issue (for me anyway): http://play.rust-lang.org/?gist=564c477ac1347b452ebd2ae784094875&version=stable&mode=release&edition=2015

This is in release mode too. I’d expect it to optimize down to roughly the same thing. Is there anything I can do to encourage that optimization? Or should I just… write my own loop?


#2

Performance on the playground doesn’t seem to be very stable. For example, just now I got:

Sum was 499998500001, calculation took: Duration { secs: 0, nanos: 422543 }
Loop sum was 499998500001, calculation took: Duration { secs: 0, nanos: 779491 }

On my own system, the results were still quite noisy but the average for sum was about 20% faster than the for loop.


#3

Interesting… I suppose I’m glad it’s not a general problem with sum then. Though I’m not sure how to diagnose exactly why it’s so much slower in my case.


#4

If you swap the order around the first one is still slower than the second one, so my bet is on cache shenanigans.

Edit: By “first one is still slower” I meant that the sum becomes the faster one and the loop becomes the slower one.


#5

Let’s ask Godbolt!

A quick diff shows that the generated assembly is identical for both summation methods, label names aside.


#6

In more complicated iterators, it might matter that Sum uses a fold instead of an open for-loop. This ought to benefit performance in most cases, or at least be neutral, but perhaps you’ve found an exception?


#7

As a more general advice, I would suggest going for longer-lived microbenchmarks in the future. I personally always aim for something which runs for at least a couple of seconds. The reason is that the shorter your benchmark is, the more sensitive it gets to OS and hardware details:

  • A misplaced context switch in the OS can dramatically effect milisecond-scale timings.
  • Modern CPUs have an optimization for burst workloads where they temporarily operate over their nominal frequency (Intel calls this Turbo).
  • Conversely, some CPU features are so power-intensive that they cause CPUs to drop below their nominal frequency after operating for a while (most AVX impls are guilty of this).
  • As @quadrupleslap pointed out, it takes a while for caches (including branch predictors and the like) to warm up.

Microbenchmarking is generally a dark art, but the shorter your timings are, the more difficult it gets.


#8

Python has timeit for stuff like this, but I don’t think there’s a (popular) equivalent for Rust.


#9

Isn’t that what criterion and the unstable cargo bench are supposed to do?


#10

Nevermind, I guess there is a popular equivalent! Thanks.

Edit: And it has charts too?!


#11

Do not write benchmark code yourself — use Rust’s built-in brencher instead. It will try to run enough iterations for the test to be meaningful.