I ran into a very confusing issue where a small change in the main()
function significantly changes the performance in another function.
use solver::knapsack::knapsack;
fn f() {
let n = 20000u64;
println!(
"{:?}",
knapsack(&vec![1; n as usize], &vec![1; n as usize], n / 2, n)
);
}
fn main() -> Result<(), Box<dyn std::error::Error>> {
f();
Ok(())
}
gives
time binaries/slow
Some(10000)
real 0m0.374s
user 0m0.369s
sys 0m0.004s
whereas
use solver::knapsack::knapsack;
fn f() {
let n = 20000u64;
println!(
"{:?}",
knapsack(&vec![1; n as usize], &vec![1; n as usize], n / 2, n)
);
}
fn main() {
f();
}
gives
time binaries/fast
Some(10000)
real 0m0.238s
user 0m0.230s
sys 0m0.007s
I uploaded the full example repo here: code.zip - Google Drive . In both binaries, the expensive function knapsack()
has the same assembly code except for different addresses. perf
reports a mov
as the most expensive instruction in the slow binary taking 24% of the time whereas the same instruction takes only 5% of time in the fast binary.
I also tried running perf stat
. It consistently reports 3 instructions per second for the slow binary and 5 instructions per second for the fast binary.
perf stat -B -e cache-references,cache-misses,cycles,instructions,branches,faults,migrations,alignment-faults binaries/slow
Some(10000)
Performance counter stats for 'binaries/slow':
309874 cache-references
84688 cache-misses # 27.33% of all cache refs
1210412268 cycles
3906891507 instructions # 3.23 insn per cycle
1201479244 branches
178 faults
0 migrations
0 alignment-faults
0.371487174 seconds time elapsed
0.367595000 seconds user
0.002973000 seconds sys
perf stat -B -e cache-references,cache-misses,cycles,instructions,branches,faults,migrations,alignment-faults binaries/fast
Some(10000)
Performance counter stats for 'binaries/fast':
319269 cache-references
66133 cache-misses # 20.71% of all cache refs
758630423 cycles
3905954725 instructions # 5.15 insn per cycle
1201304136 branches
176 faults
0 migrations
0 alignment-faults
0.229591012 seconds time elapsed
0.226624000 seconds user
0.001962000 seconds sys
I suspect it has something to do with memory layout, but my knowledge about how CPUs work is limited. Any advice or suggestions how to debug this are really appreciated!
rustc --version
rustc 1.85.0 (4d91de4e4 2025-02-17) (Arch Linux rust 1:1.85.0-1)
uname -a
Linux machine 6.13.6-hardened1-1-hardened #1 SMP PREEMPT_DYNAMIC Sat, 08 Mar 2025 22:45:14 +0000 x86_64 GNU/Linux