Performance degradation when changing while loop to if statement

I'm trying to optimize a program for a numerical simulation related to some one-pile Nim variant. The details aren't really important. Here's a stripped-down version of the code:

fn main() {
    let (mut rn, mut rd) = (1, 1);
    for _ in 1..500_000 {
        (rn, rd) = next(rn, rd);
    }
    println!("{rn}/{rd} {}", rn as f64 / rd as f64);
}

fn next(rn: i64, rd: i64) -> (i64, i64) {
    let mut n = rn / rd + 1;
    let mut w: Vec<i64> = (1..=n).collect();
    let mut i = 1;
    let (mut next_rn, mut next_rd) = (rn + rd, rd);
    while n < 50_000_000 {
        let min = (n * rd + rn - 1) / rn;
        while w[i] < min {
            i += 1;
        }
        if n * next_rd < next_rn * w[i - 1] {
            (next_rn, next_rd) = (n, w[i - 1]);
        }
        n += w[i];
        w.push(n);
    }
    let d = num::integer::gcd(next_rn, next_rd);
    (next_rn / d, next_rd / d)
}

The code above runs in about 4 seconds on my laptop (measured with cargo build --release; time target/release/<bin_name>).

Due to the mathematical properties of the Nim game, it's possible to proof that the inner while loop always runs either zero or one times (and empirically it's true as well; it's usually once and sometimes zero times). This means the inner while loop could be written as a simple if, which I'd prefer, since I'd like to add an else branch for further optimizations. Surprisingly, changing while w[i] < min to if w[i] < min increases the runtime of the whole program by a factor of 3 – it then takes 12 seconds instead of 4. (I verified with manual instrumentation that all lines in the code are still executed exactly the same number of times, and the result is still the same.)

I first suspected that the slow-down may be caused by some degradation in branch prediction. However, using profile-guided optimization doesn't help at all, so I don't think it's branch prediction. I also tried understanding the difference by looking at the assembly code, but the assembly was too complex (at least for me).

Why does this change make the code so much slower? Is there any way I can use an if there without making the code slow down so much?

(Target tuple x86_64-unknown-linux-gnu, cargo 1.81.0-nightly (3ed207e41 2024-06-18), behaviour on stable is the same, no unusual settings in Cargo.toml, no additional rustc flags)

Why does this change make the code so much slower?

The only way to really know is to study the assembly.

One thing that comes to mind, though, is that while w[i] < min guarantees that after it ends, w[i] >= min, but if w[i] < min doesn't. But that's not it, because I tried this:

        if w[i] < min {
            i += 1;
        }
        if w[i] < min {
            unsafe {
                std::hint::unreachable_unchecked();
            }
        }

which would probably have recovered the original performance if that's the exact reason, and it didn't. But that's the kind of thing that can matter.

1 Like

Being able to assume w[i] < min doesn't seem useful for any optimization of the remaining code, so this didn't look like a plausible cause (and I had tried that already anyway). Here are a few other things I tried:

  • Disabling bounds checking for vector indexing by using get_unchecked().

  • Using usigned integers so division can't overflow.

  • Adding hints to allow the compiler to assume we never divide by zero.

Nothing so far has had any significant effect.

Here's the minimal assembly diff I was able to get, using all of the things mentioned above to remove all the panicking code: Compiler Explorer

Most of the differences seem to be renamed labels and different register allocations. The code of the inner loop is quite different, though. I can neither tell what's causing the compiler to emit the different code nor why one version is so much slower than the other.

1 Like

It turns out that while the MIR essentially has the differences you'd expect for the two versions, the LLVM IR is already quite different. I assumed that the change somehow trips up LLVM's optimization pipeline, but it looks like the relevant difference is actually happnening in rustc.

I would suggest trying to play with likely and unlikely markups.

If compiler treats that if as “exceptional case” then it's possible that version with if would be slower, but 4x is still extreme.

Would be good idea to file a bug against rustc, optimizations are not guaranteed, but such a huge difference is worth investigating.

To me, it looks like the only relevant difference is that the version with if uses a branchless implementation for conditionally incrementing i: the comparison in line 473 sets the carry flag, which is then (ab-)used by the following add-with-carry to maybe increment r14 (i). Why would this make a 3x performance difference? I have a theory, based on this bit of information:

I don't know what percentage "usually" and "sometimes" stand for, but it suggests the number of iterations in the inner loop is very predictable (there may also be patterns over multiple outer loop iterations that branch prediction can pick up). So why is this relevant? Because when you look at the work done in the while n < 50_000_000 loop (I'll call that the outer loop from now on), you'll find that each iteration depends on the previous one:

  • At the end of one iteration, n += w[i]; depends on the current value of i
  • In the next iteration, this updated n is required to compute min
  • The decision whether to increment i again depends on min.
  • The updated i is then needed for n += w[i]; which the following iteration depends on.
  • (The computations related to next_rn and next_rd also depend on n and i, but this is not on the critical path because they're only used after the outer loop has finished.)

In the original code, if the branch predictor managed to predict whether i would be incremented in any given outer loop iteration, the CPU could start a lot of work already knowing the value of i without having to wait for most of the previous loop iteration. Sure, it has to compute the n += w[i] and everything that follows eventually, but it can overlap and interleave most of the work for multiple subsequent iterations as long as it keeps correctly predicting the inner loop. So when the outer loop runs for many iterations, there's more independent work for the CPU's many functional units.

In contrast, the new version compiles to a branch-free i += 1 that uses the w[i] < min comparison as input data. The upside and downside of branch-free code in general is that it turns control flow dependencies into data dependencies. This can be preferable for unpredictable branches, but it can also be disastrous when branch prediction worked very well and the data dependency structure is unfavorable to modern CPUs. This is the reason why compilers are often so reluctant to emit branch-free code!

In this case, the i += 1 (cmp + adc) in the next iteration can't start before the new min is computed, which involves a bunch of integer arithmetic including a division, and can't start before n += w[i] has finished, which in turn depends on the previous i += 1, ad infinitum. So here you have a long critical path through both integer division (generally 10+ cycles of latency, much worse on older CPUs) and memory accesses (likely L1 hit, but that's still 4-6 cycles of latency). The result is that the CPU is mostly idle in most clock cycles, only getting a drip feed of new work because it often has to wait for a single instruction to finish before any more can be started.

On my machine, the performance difference is 1.87x instead of 3x, but perf stat confirms that the difference is due to lower instructions per cycle despite the faster version retiring more instructions and more branches. Here's the relevant metrics (for the versions on godbolt):

$ perf stat ./nim-if
    69,839,079,505      cycles                    #    3.946 GHz
    47,419,353,487      instructions              #    0.68  insn per cycle
         4,853,966      branch-misses             #    0.09% of all branches

      17.699183047 seconds time elapsed
$ perf stat ./nim-while
    37,422,607,513      cycles                    #    3.933 GHz
    57,932,310,046      instructions              #    1.55  insn per cycle
        55,566,801      branch-misses             #    0.66% of all branches

       9.515092061 seconds time elapsed

(I'll note that, on my machine, the IPC for the faster version is still disappointing. This is a Skylake CPU, it could reach up to 4 IPC in ideal conditions and 2-3 is totally feasible for many compute-heavy workloads.)

For a more thorough introduction to this topic, see A whirlwind introduction to dataflow graphs. That blog's later series on "Entropy Coding in Oodle" is also a great worked example of how massaging dependency structure and instruction mix can make a huge difference on modern CPUs.

I hope this answers why the change makes the code so much slower. I predict that the source code change wouldn't have any performance effect if you could convince the compiler to keep it a branch instead of emitting something like cmp + adc. But this is easier said than done. It's more common that people have to fight the compiler to get branch-free codegen, and that's also rarely successful. You could try profile-guided optimization, since LLVM sometimes uses branch weight information for these decisions. There's also the likely/unlikely intrinsics on nightly (to influence the same branch weights without PGO), though I've heard that they're not super reliable.

4 Likes

It's hard to do in a sane, production-ready fashion, but for experiments you may just simply use asm!("inc {i}", i = inout(reg) i); instead of i += 1;.

This wouldn't produce optimal code, but this would, most definitely, produce a branch and something not too far from optimal.

Should be enough to verify your hypothesis.

1 Like

@hanna-kruppe Wow, thanks a lot for your thorough answer and the pointers to more information! I was completely unaware that branchless code could be that much slower.

Profile-guided optimization actually was the first thing I tried (as mentioned in my original post), but it made zero difference for either of the two cases. From this I concluded that the cause probably isn't branch prediction, but I simply don't know much about this.

I'll do experiments informed by your answer after work.

This actually mostly works. Using this in the while version does not impact performance at all, and using it in the if version makes that version almost as fast ast the while version (still about 10 percent slower).

Sorry, I had forgotten about it by the time I got to the end of my reply. There's several places in LLVM that have a say in whether code fragments like this end up as branch or not (including one pass late in the x86 backend), but last time I looked into it, only some of them were considering branch weight metadata. It's also possible that the weights get dropped by another optimization pass before the point where they could make a difference (e.g., if the branch gets converted to a select early on). Alternatively, some heuristics may see the branch weights but not consider them conclusive enough to prefer a branch (the branch may be much more predictable to the CPU than the taken/not-taken counts suggest).

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.