Small code change greatly affects running time

I ran into a very confusing issue where a small change in the main() function significantly changes the performance in another function.

use solver::knapsack::knapsack;

fn f() {
  let n = 20000u64;
  println!(
    "{:?}",
    knapsack(&vec![1; n as usize], &vec![1; n as usize], n / 2, n)
  );
}

fn main() -> Result<(), Box<dyn std::error::Error>> {
  f();
  Ok(())
}

gives

time binaries/slow 
Some(10000)

real	0m0.374s
user	0m0.369s
sys	0m0.004s

whereas

use solver::knapsack::knapsack;

fn f() {
  let n = 20000u64;
  println!(
    "{:?}",
    knapsack(&vec![1; n as usize], &vec![1; n as usize], n / 2, n)
  );
}

fn main() {
  f();
}

gives

time binaries/fast
Some(10000)

real	0m0.238s
user	0m0.230s
sys	0m0.007s

I uploaded the full example repo here: code.zip - Google Drive . In both binaries, the expensive function knapsack() has the same assembly code except for different addresses. perf reports a mov as the most expensive instruction in the slow binary taking 24% of the time whereas the same instruction takes only 5% of time in the fast binary.

I also tried running perf stat. It consistently reports 3 instructions per second for the slow binary and 5 instructions per second for the fast binary.

perf stat -B -e cache-references,cache-misses,cycles,instructions,branches,faults,migrations,alignment-faults binaries/slow
Some(10000)

 Performance counter stats for 'binaries/slow':

            309874      cache-references                                                      
             84688      cache-misses                     #   27.33% of all cache refs         
        1210412268      cycles                                                                
        3906891507      instructions                     #    3.23  insn per cycle            
        1201479244      branches                                                              
               178      faults                                                                
                 0      migrations                                                            
                 0      alignment-faults                                                      

       0.371487174 seconds time elapsed

       0.367595000 seconds user
       0.002973000 seconds sys
perf stat -B -e cache-references,cache-misses,cycles,instructions,branches,faults,migrations,alignment-faults binaries/fast
Some(10000)

 Performance counter stats for 'binaries/fast':

            319269      cache-references                                                      
             66133      cache-misses                     #   20.71% of all cache refs         
         758630423      cycles                                                                
        3905954725      instructions                     #    5.15  insn per cycle            
        1201304136      branches                                                              
               176      faults                                                                
                 0      migrations                                                            
                 0      alignment-faults                                                      

       0.229591012 seconds time elapsed

       0.226624000 seconds user
       0.001962000 seconds sys

I suspect it has something to do with memory layout, but my knowledge about how CPUs work is limited. Any advice or suggestions how to debug this are really appreciated!

rustc --version
rustc 1.85.0 (4d91de4e4 2025-02-17) (Arch Linux rust 1:1.85.0-1)
uname -a
Linux machine 6.13.6-hardened1-1-hardened #1 SMP PREEMPT_DYNAMIC Sat, 08 Mar 2025 22:45:14 +0000 x86_64 GNU/Linux

You might be interested in the famous Stabilizer paper. Here's a talk: https://www.youtube.com/watch?v=8Ph1l-_0PeI (I don't know if that's the best one, but it's the first one that came up)

Basically, performance measurement is a crapshoot, and gets worse the shorter the thing you try to measure.

3 Likes

Thanks! It makes sense, but is there a way to find exactly what causes the performance change? And is there a way to force the compiler to always produce the best performing code?

As the talk points out, it can be exactly the same binary but depending how the OS loads it, still end up being up to 2× slower. The compiler can't fix that.

Here's a webpage for the research (including slides and the paper).

1 Like

Not that this in any way undermines the point you were making, but I've always been pretty skeptical of some of the claims in the Stabilizer paper. It's been a while since I took a statistics class, but the way the authors apply statistics to measuring software performance seems off to me:

  1. Computers are highly deterministic. If you see little variance in the performance of a particular binary on a particular machine then that's because that's the performance of that particular binary on that particular machine. The authors point out that program layout has a surprisingly large impact on performance, but that doesn't mean that randomizing the layout of a program when measuring will somehow make the measurement more representative, only more robust to future layout changes.
  2. Stabilizer works by inserting padding into stack frames and copying the code of functions around to random places in memory while the program is running. This strikes me as dramatically different to how programs run in actual practice, but the authors don't give a satisfying reason why this wouldn't bias the data in some way. They acknowledge that Stabilizer has an impact on performance, but assume that its impact is independent of the impact of the optimizations they're measuring, which strikes me as unlikely.
  3. This is the one that raises the most alarm bells for me: The paper concludes that the impact of -O3 vs -O2 on performance is not statistically significant, but statistical significance is a function of the number of samples, and you can crank it arbitrarily high simply by taking more samples. The paper doesn't state how many samples the authors took, nor how they chose how many samples to take.
  4. It's a small thing, but the paper also seems weirdly obsessed with normal distributions. As I understand it, the distribution of the mean of a set of samples from any distribution always trends towards normal as the number of samples increases, so it doesn't seem critical to the analysis that the source distribution be normal as well.

I could be way off base, though; I'm definitely no expert.

Even if the chip is perfectly deterministic -- which is less and less true with things like TurboBoost and Precision Boost Overdrive that change performance depending on the current cooling situation -- we then run the programs in operating systems that intentionally add non-determinism like Address space layout randomization - Wikipedia.

And even if your machine is perfectly consistent about it, you probably want to know that you're not just optimizing for the current length of your environment variables, say. If it's faster, but only for people whose home directory path has length 7 mod 16, is that really a faster that you actually care about?

I think it was this forum where someone traced down a difference to the exact same sequence of assembly instructions happening to run differently depending on whether the loop header happened to get loaded at an cache line boundary or not, for example.

2 Likes

True, which is why I hedged with the "if you see little variance" bit :P

Also true! Just because something is deterministic doesn't mean you can't accidentally "over-fit" an optimization to your current setup. That doesn't mean that randomizing on the very specific axis that Stabilizer randomizes on is going to be representative of the "randomness" introduced by hardware and OS configuration. Stabilizer's not exactly launching the program with random environment variables or home directory path names.