For a long time I have noticed a large variability in run times when running on my Windows laptop. For apparently no good reason a program will take significantly longer to run on one occasion than another. Today I started using criterion, and it shows that same thing. Here is a sample run:
C:\Users\ano31\Rust\btree>cargo bench Clone --benches
Finished `bench` profile [optimized] target(s) in 0.12s
Running unittests src\lib.rs (target\release\deps\btree_experiment-1936bc41af1eb523.exe)
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 92 filtered out; finished in 0.00s
Running benches\crit_bench.rs (target\release\deps\crit_bench-9ddef60c69b7e48c.exe)
Gnuplot not found, using plotters backend
Clone/Exp/1000 time: [535.10 ns 538.56 ns 542.22 ns]
change: [-4.4435% -2.7929% -1.3413%] (p = 0.00 < 0.05)
Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
3 (3.00%) high mild
4 (4.00%) high severe
Clone/Std/1000 time: [6.2010 µs 6.2067 µs 6.2126 µs]
change: [-0.6990% -0.0244% +0.6773%] (p = 0.95 > 0.05)
No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
5 (5.00%) high severe
Clone/Exp/2000 time: [1.4004 µs 1.4570 µs 1.5190 µs]
change: [+8.1871% +14.667% +21.200%] (p = 0.00 < 0.05)
Performance has regressed.
Clone/Std/2000 time: [12.598 µs 12.623 µs 12.651 µs]
change: [-1.3994% -0.5114% +0.2814%] (p = 0.25 > 0.05)
No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
3 (3.00%) high mild
2 (2.00%) high severe
C:\Users\ano31\Rust\btree>
So one of the tests is showing a 14% difference for one of the tests compared to the previous run (obviously I didn't change anything, I just repeated the cargo bench command ). I have no idea what causes these large fluctuations, it seems quite bizarre. Have you seen the same thing? Any ideas what can cause this?
thermal throttling, small and big cores, preferred cores, turbo boost
Clone
I assume your benchmark involves allocations. The amount of work an allocator and the underlying OS calls performs can vary between runs depending on memory pressure and such.
Making sure the system is as idle as possible, pinning the tasks to one kind of core and using instruction-count profiling can help with some of those things.
The 100% CPU is I think the "Warmup". In-between those regions, CPU should go to 25% but in the abnormal run something else happens in-between. Quite what I don't know yet.
Why would warmup repeatedly take a different amount of CPU time than the main runs? I can see the first one taking more time when it needs to build or initialize things but subsequent runs should all take the same time.
But if there's a spike it might be a different process competing for CPU. I see you have only 2C/4T and a lot of applications open. It'll be hard to get a system like that quiet.
In that case getting instruction count summary statistics is probably a better approach. But I don't know how to do this on windows. Another option might getting the sum of CPU time spent at the end of the process, which on a noisy system should also be more reliable than wall-clock time
I tried shutting down as many applications as I could ( chrome, various messaging apps ) and disabling real-time virus protection. It may have helped a bit, it is hard to be sure.
What I have noticed is my version of BTreeMap clone shows more variability than the std implementation. I think this would be because it has less overhead (it is 10 times faster) and so is more sensitive to memory allocation performance. So it seems that is the likely source of the variable performance, although why I do not know.
Latest run results ( note: "+25.743%" for Exp/2000 ):
Running benches\crit_bench.rs (target\release\deps\crit_bench-9ddef60c69b7e48c.exe)
Gnuplot not found, using plotters backend
Benchmarking Clone/Exp/1000: Collecting 100 samples in estimateClone/Exp/1000 time: [611.15 ns 623.99 ns 636.81 ns]
change: [+7.8637% +11.381% +14.855%] (p = 0.00 < 0.05)
Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
Benchmarking Clone/Std/1000: Collecting 100 samples in estimateClone/Std/1000 time: [6.1683 µs 6.1739 µs 6.1807 µs]
change: [-0.7580% +0.2692% +1.6495%] (p = 0.74 > 0.05)
No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
2 (2.00%) high mild
6 (6.00%) high severe
Benchmarking Clone/Exp/2000: Collecting 100 samples in estimateClone/Exp/2000 time: [1.5336 µs 1.6041 µs 1.6881 µs]
change: [+19.176% +25.743% +33.762%] (p = 0.00 < 0.05)
Performance has regressed.
Found 16 outliers among 100 measurements (16.00%)
2 (2.00%) high mild
14 (14.00%) high severe
Benchmarking Clone/Std/2000: Collecting 100 samples in estimateClone/Std/2000 time: [12.863 µs 13.047 µs 13.262 µs]
change: [+0.7239% +1.8251% +3.0070%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 12 outliers among 100 measurements (12.00%)
7 (7.00%) high mild
5 (5.00%) high severe
C:\Users\ano31\Rust\btree>
Edit: I tried bench-marking a different BTreeMap method ( get ) which doesn't allocate and for that the results are much more stable (generally within 2%), which seems to confirm it is related to allocation.
Oh, I just noticed it is the main timing run that uses 100% CPU, not the warmup. CPU is 100% just before the test finishes, and then goes to zero. So apparently criterion uses multiple threads for the main timing run. Or... maybe it is the analysis that uses 100% CPU. Hmm, based on the messages it outputs, maybe that is what it is.
While trying to improve my own benchmarks’ stability, I have observed that if you increase nresamples, the analysis can take many seconds. I'm not familiar with the particular method being used here, but I'd say it perhaps isn't “basic” statistics.
At the point where a benchmark starts to be dominated by allocation, it can be quite informative to try them out with a different allocator, especially on Windows where HeapAlloc isn't super great and requires taking a process-global lock. Switching to using jemalloc as your #[global_allocator] can often have an outsized impact for common allocation patterns.
I am generally using mimalloc ( which tends to be significantly faster ( 3 to 6 times faster for my clone benchmark than the default allocator ) for testing and benchmarks.
Here are results after temporarily disabling mimalloc then re-enabling it:
C:\Users\ano31\Rust\btree>cargo bench Clone
Compiling btree_experiment v0.1.96 (C:\Users\ano31\Rust\btree)
Finished `bench` profile [optimized] target(s) in 6.19s
Running unittests src\lib.rs (target\release\deps\btree_experiment-c743cbd3baad9b93.exe)
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 92 filtered out; finished in 0.00s
Running benches\crit_bench.rs (target\release\deps\crit_bench-a9622620b77da297.exe)
Gnuplot not found, using plotters backend
Clone/Exp/1000 time: [749.59 ns 754.38 ns 759.75 ns]
change: [-87.667% -87.535% -87.401%] (p = 0.00 < 0.05)
Performance has improved.
Found 17 outliers among 100 measurements (17.00%)
5 (5.00%) high mild
12 (12.00%) high severe
Clone/Std/1000 time: [6.3861 µs 6.4250 µs 6.4696 µs]
change: [-78.285% -77.879% -77.522%] (p = 0.00 < 0.05)
Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
7 (7.00%) high mild
7 (7.00%) high severe
Clone/Exp/10000 time: [9.9145 µs 9.9766 µs 10.046 µs]
change: [-76.750% -76.419% -75.948%] (p = 0.00 < 0.05)
Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
7 (7.00%) high mild
9 (9.00%) high severe
Clone/Std/10000 time: [69.813 µs 71.244 µs 73.330 µs]
change: [-66.723% -66.052% -65.409%] (p = 0.00 < 0.05)
Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high severe