Today I find that when I compile and run it on a Raspberry Pi 3B+ the execution time depends very much on whether I run it as "cargo run --release" or "./target/release/myProg".
For example:
$ cargo run --release
Finished release [optimized] target(s) in 0.09s
Running `target/release/fftbench-rust`
fft_bench v1.2
Freq. Magnitude
0x000000 0x0001fe
0x0000c0 0x0001ff
0x000140 0x0001ff
0x000200 0x0001ff
1024 point bit-reversal and butterfly run time = 190us
vs:
$ ./target/release/fftbench-rust
fft_bench v1.2
Freq. Magnitude
0x000000 0x0001fe
0x0000c0 0x0001ff
0x000140 0x0001ff
0x000200 0x0001ff
1024 point bit-reversal and butterfly run time = 423us
What is going on here? Seems "cargo run --release" is not running the same binary as "../target/release/fft_bench".
Beyond that though, cargo run will check to see if things are compiled, and then compile them if they aren't, and so it's inherently gonna be slower than running the binary directly; it does some work first.
I would guess CPU frequency scaling, where it ramps up during the 0.09s time taken by cargo, and then when your program starts it's already at top speed.
BINGO! Yes, I forgot about the frequency scaling issue. I changed the timing loop to run indefinitely. Sure enough it settles down to 190us, give or take a lot of noise, in both cases.
I forgot to mention my original attempt actually ran the thing 10 times. Always the first run or two are much slower, presumably as caches and such warm up. It times itself using the "time" crate so the run time of cargo and such is not involved. The test data is always the same.
I would also recommend checking out the criterion crate if you want more reliable benchmarks. It's explicitly designed to deal with this sort of thing, plus it gives you pretty graphs and such for free
Thanks for the suggestions. I have been down the rabbit hole of bench marking/optimizing many times over the years and soon found those resources.
Typically though it ends in frustration. After much inspection, profiling, tweaking, tying different algorithms, random tweaking one ends up in a situation where the most gains to be had are achieved and one is scratching around for single percentage improvements. Then one finds what seem to be benign changes make a difference for no fathomable reason (As the video above explains well). Then for extra frustration one finds that what improves performance on one platform makes it worse on another. As I found recently moving code between Raspberry Pi and PC.