So I noticed this while profiling my path tracer. My dot-product was using sum internally and I was getting ~14 FPS - and with a normal loop, I was getting ~23 FPS.
This playground link shows the same issue (for me anyway): Rust Playground
This is in release mode too. I'd expect it to optimize down to roughly the same thing. Is there anything I can do to encourage that optimization? Or should I just... write my own loop?
Interesting... I suppose I'm glad it's not a general problem with sum then. Though I'm not sure how to diagnose exactly why it's so much slower in my case.
In more complicated iterators, it might matter that Sum uses a fold instead of an open for-loop. This ought to benefit performance in most cases, or at least be neutral, but perhaps you've found an exception?
As a more general advice, I would suggest going for longer-lived microbenchmarks in the future. I personally always aim for something which runs for at least a couple of seconds. The reason is that the shorter your benchmark is, the more sensitive it gets to OS and hardware details:
A misplaced context switch in the OS can dramatically effect milisecond-scale timings.
Modern CPUs have an optimization for burst workloads where they temporarily operate over their nominal frequency (Intel calls this Turbo).
Conversely, some CPU features are so power-intensive that they cause CPUs to drop below their nominal frequency after operating for a while (most AVX impls are guilty of this).
As @quadrupleslap pointed out, it takes a while for caches (including branch predictors and the like) to warm up.
Microbenchmarking is generally a dark art, but the shorter your timings are, the more difficult it gets.