Why is my program twice as fast run as "cargo run --release" than "./target/release/myProg"?

I have a simple little FFT program in Rust: https://github.com/ZiCog/fftbench

Today I find that when I compile and run it on a Raspberry Pi 3B+ the execution time depends very much on whether I run it as "cargo run --release" or "./target/release/myProg".

For example:

$ cargo run --release
    Finished release [optimized] target(s) in 0.09s
     Running `target/release/fftbench-rust`
fft_bench v1.2
Freq.    Magnitude
0x000000 0x0001fe
0x0000c0 0x0001ff
0x000140 0x0001ff
0x000200 0x0001ff
1024 point bit-reversal and butterfly run time = 190us


$ ./target/release/fftbench-rust
fft_bench v1.2
Freq.    Magnitude
0x000000 0x0001fe
0x0000c0 0x0001ff
0x000140 0x0001ff
0x000200 0x0001ff
1024 point bit-reversal and butterfly run time = 423us

What is going on here? Seems "cargo run --release" is not running the same binary as "../target/release/fft_bench".

Did you make sure to compile before running?

Beyond that though, cargo run will check to see if things are compiled, and then compile them if they aren't, and so it's inherently gonna be slower than running the binary directly; it does some work first.

This seems to be irrelevant here, since cargo run in fact is not slower then running directly, but faster.

The variance is here is only 230 microseconds and the Raspberry Pi is pretty underpowered. This may not be a meaningful statistical difference.

Here's a few things to consider:

  • Do you consistently get that same variance across runs?

  • Is your program deterministic or can the amount of work it does vary across runs?

  • Does your program do any kind of operation that could be affected by OS level caching or other such factors?

    • For example: opening a file or doing IPC
  • What other programs are running in the background?


I would guess CPU frequency scaling, where it ramps up during the 0.09s time taken by cargo, and then when your program starts it's already at top speed.



BINGO! Yes, I forgot about the frequency scaling issue. I changed the timing loop to run indefinitely. Sure enough it settles down to 190us, give or take a lot of noise, in both cases.

I forgot to mention my original attempt actually ran the thing 10 times. Always the first run or two are much slower, presumably as caches and such warm up. It times itself using the "time" crate so the run time of cargo and such is not involved. The test data is always the same.

Thanks all, mystery solved.


You might find this video interesting.

1 Like

I would also recommend checking out the criterion crate if you want more reliable benchmarks. It's explicitly designed to deal with this sort of thing, plus it gives you pretty graphs and such for free :slight_smile:

1 Like

Thanks for the suggestions. I have been down the rabbit hole of bench marking/optimizing many times over the years and soon found those resources.

Typically though it ends in frustration. After much inspection, profiling, tweaking, tying different algorithms, random tweaking one ends up in a situation where the most gains to be had are achieved and one is scratching around for single percentage improvements. Then one finds what seem to be benign changes make a difference for no fathomable reason (As the video above explains well). Then for extra frustration one finds that what improves performance on one platform makes it worse on another. As I found recently moving code between Raspberry Pi and PC.


This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.