To me, it looks like the only relevant difference is that the version with if
uses a branchless implementation for conditionally incrementing i
: the comparison in line 473 sets the carry flag, which is then (ab-)used by the following add-with-carry to maybe increment r14
(i
). Why would this make a 3x performance difference? I have a theory, based on this bit of information:
I don't know what percentage "usually" and "sometimes" stand for, but it suggests the number of iterations in the inner loop is very predictable (there may also be patterns over multiple outer loop iterations that branch prediction can pick up). So why is this relevant? Because when you look at the work done in the while n < 50_000_000
loop (I'll call that the outer loop from now on), you'll find that each iteration depends on the previous one:
- At the end of one iteration,
n += w[i];
depends on the current value of i
- In the next iteration, this updated
n
is required to compute min
- The decision whether to increment
i
again depends on min
.
- The updated
i
is then needed for n += w[i];
which the following iteration depends on.
- (The computations related to
next_rn
and next_rd
also depend on n
and i
, but this is not on the critical path because they're only used after the outer loop has finished.)
In the original code, if the branch predictor managed to predict whether i
would be incremented in any given outer loop iteration, the CPU could start a lot of work already knowing the value of i
without having to wait for most of the previous loop iteration. Sure, it has to compute the n += w[i]
and everything that follows eventually, but it can overlap and interleave most of the work for multiple subsequent iterations as long as it keeps correctly predicting the inner loop. So when the outer loop runs for many iterations, there's more independent work for the CPU's many functional units.
In contrast, the new version compiles to a branch-free i += 1
that uses the w[i] < min
comparison as input data. The upside and downside of branch-free code in general is that it turns control flow dependencies into data dependencies. This can be preferable for unpredictable branches, but it can also be disastrous when branch prediction worked very well and the data dependency structure is unfavorable to modern CPUs. This is the reason why compilers are often so reluctant to emit branch-free code!
In this case, the i += 1
(cmp + adc) in the next iteration can't start before the new min
is computed, which involves a bunch of integer arithmetic including a division, and can't start before n += w[i]
has finished, which in turn depends on the previous i += 1
, ad infinitum. So here you have a long critical path through both integer division (generally 10+ cycles of latency, much worse on older CPUs) and memory accesses (likely L1 hit, but that's still 4-6 cycles of latency). The result is that the CPU is mostly idle in most clock cycles, only getting a drip feed of new work because it often has to wait for a single instruction to finish before any more can be started.
On my machine, the performance difference is 1.87x instead of 3x, but perf stat
confirms that the difference is due to lower instructions per cycle despite the faster version retiring more instructions and more branches. Here's the relevant metrics (for the versions on godbolt):
$ perf stat ./nim-if
69,839,079,505 cycles # 3.946 GHz
47,419,353,487 instructions # 0.68 insn per cycle
4,853,966 branch-misses # 0.09% of all branches
17.699183047 seconds time elapsed
$ perf stat ./nim-while
37,422,607,513 cycles # 3.933 GHz
57,932,310,046 instructions # 1.55 insn per cycle
55,566,801 branch-misses # 0.66% of all branches
9.515092061 seconds time elapsed
(I'll note that, on my machine, the IPC for the faster version is still disappointing. This is a Skylake CPU, it could reach up to 4 IPC in ideal conditions and 2-3 is totally feasible for many compute-heavy workloads.)
For a more thorough introduction to this topic, see A whirlwind introduction to dataflow graphs. That blog's later series on "Entropy Coding in Oodle" is also a great worked example of how massaging dependency structure and instruction mix can make a huge difference on modern CPUs.
I hope this answers why the change makes the code so much slower. I predict that the source code change wouldn't have any performance effect if you could convince the compiler to keep it a branch instead of emitting something like cmp + adc. But this is easier said than done. It's more common that people have to fight the compiler to get branch-free codegen, and that's also rarely successful. You could try profile-guided optimization, since LLVM sometimes uses branch weight information for these decisions. There's also the likely/unlikely intrinsics on nightly (to influence the same branch weights without PGO), though I've heard that they're not super reliable.