Unstable benchmark times

For a long time I have noticed a large variability in run times when running on my Windows laptop. For apparently no good reason a program will take significantly longer to run on one occasion than another. Today I started using criterion, and it shows that same thing. Here is a sample run:

C:\Users\ano31\Rust\btree>cargo bench Clone --benches
    Finished `bench` profile [optimized] target(s) in 0.12s
     Running unittests src\lib.rs (target\release\deps\btree_experiment-1936bc41af1eb523.exe)

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 92 filtered out; finished in 0.00s

     Running benches\crit_bench.rs (target\release\deps\crit_bench-9ddef60c69b7e48c.exe)
Gnuplot not found, using plotters backend
Clone/Exp/1000          time:   [535.10 ns 538.56 ns 542.22 ns]
                        change: [-4.4435% -2.7929% -1.3413%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe
Clone/Std/1000          time:   [6.2010 µs 6.2067 µs 6.2126 µs]
                        change: [-0.6990% -0.0244% +0.6773%] (p = 0.95 > 0.05)
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high severe
Clone/Exp/2000          time:   [1.4004 µs 1.4570 µs 1.5190 µs]
                        change: [+8.1871% +14.667% +21.200%] (p = 0.00 < 0.05)
                        Performance has regressed.
Clone/Std/2000          time:   [12.598 µs 12.623 µs 12.651 µs]
                        change: [-1.3994% -0.5114% +0.2814%] (p = 0.25 > 0.05)
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe


C:\Users\ano31\Rust\btree>

So one of the tests is showing a 14% difference for one of the tests compared to the previous run (obviously I didn't change anything, I just repeated the cargo bench command ). I have no idea what causes these large fluctuations, it seems quite bizarre. Have you seen the same thing? Any ideas what can cause this?

laptop

thermal throttling, small and big cores, preferred cores, turbo boost

Clone

I assume your benchmark involves allocations. The amount of work an allocator and the underlying OS calls performs can vary between runs depending on memory pressure and such.

Making sure the system is as idle as possible, pinning the tasks to one kind of core and using instruction-count profiling can help with some of those things.

2 Likes

I have just been watching with task monitor, and eventually I caught "something" going on.

This was an abnormal run with results like this:

     Running benches\crit_bench.rs (target\release\deps\crit_bench-9ddef60c69b7e48c.exe)
Gnuplot not found, using plotters backend
Benchmarking Clone/Exp/1000: Collecting 100 samples in estimated 5.0004 s (8.8M iteClone/Exp/1000          time:   [563.39 ns 591.03 ns 626.77 ns]
                        change: [-1.1031% +2.2794% +6.0054%] (p = 0.21 > 0.05)
                        No change in performance detected.
Found 15 outliers among 100 measurements (15.00%)
  4 (4.00%) high mild
  11 (11.00%) high severe
Benchmarking Clone/Std/1000: Collecting 100 samples in estimated 5.0071 s (656k iteClone/Std/1000          time:   [7.0484 µs 7.2135 µs 7.4043 µs]
                        change: [+17.383% +21.846% +27.459%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe
Benchmarking Clone/Exp/2000: Collecting 100 samples in estimated 5.0037 s (3.6M iteClone/Exp/2000          time:   [1.3634 µs 1.4009 µs 1.4446 µs]
                        change: [+1.5993% +4.1263% +6.6068%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe
Benchmarking Clone/Std/2000: Collecting 100 samples in estimated 5.0349 s (328k iteClone/Std/2000          time:   [14.220 µs 14.448 µs 14.725 µs]
                        change: [+20.592% +25.155% +29.938%] (p = 0.00 < 0.05)
                        Performance has regressed.


C:\Users\ano31\Rust\btree>

Here is what a "normal" run looks like:

The 100% CPU is I think the "Warmup". In-between those regions, CPU should go to 25% but in the abnormal run something else happens in-between. Quite what I don't know yet.

Why would warmup repeatedly take a different amount of CPU time than the main runs? I can see the first one taking more time when it needs to build or initialize things but subsequent runs should all take the same time.

But if there's a spike it might be a different process competing for CPU. I see you have only 2C/4T and a lot of applications open. It'll be hard to get a system like that quiet.

In that case getting instruction count summary statistics is probably a better approach. But I don't know how to do this on windows. Another option might getting the sum of CPU time spent at the end of the process, which on a noisy system should also be more reliable than wall-clock time

I tried shutting down as many applications as I could ( chrome, various messaging apps ) and disabling real-time virus protection. It may have helped a bit, it is hard to be sure.

What I have noticed is my version of BTreeMap clone shows more variability than the std implementation. I think this would be because it has less overhead (it is 10 times faster) and so is more sensitive to memory allocation performance. So it seems that is the likely source of the variable performance, although why I do not know.

Latest run results ( note: "+25.743%" for Exp/2000 ):

     Running benches\crit_bench.rs (target\release\deps\crit_bench-9ddef60c69b7e48c.exe)
Gnuplot not found, using plotters backend
Benchmarking Clone/Exp/1000: Collecting 100 samples in estimateClone/Exp/1000          time:   [611.15 ns 623.99 ns 636.81 ns]
                        change: [+7.8637% +11.381% +14.855%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
Benchmarking Clone/Std/1000: Collecting 100 samples in estimateClone/Std/1000          time:   [6.1683 µs 6.1739 µs 6.1807 µs]
                        change: [-0.7580% +0.2692% +1.6495%] (p = 0.74 > 0.05)
                        No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe
Benchmarking Clone/Exp/2000: Collecting 100 samples in estimateClone/Exp/2000          time:   [1.5336 µs 1.6041 µs 1.6881 µs]
                        change: [+19.176% +25.743% +33.762%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 16 outliers among 100 measurements (16.00%)
  2 (2.00%) high mild
  14 (14.00%) high severe
Benchmarking Clone/Std/2000: Collecting 100 samples in estimateClone/Std/2000          time:   [12.863 µs 13.047 µs 13.262 µs]
                        change: [+0.7239% +1.8251% +3.0070%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 12 outliers among 100 measurements (12.00%)
  7 (7.00%) high mild
  5 (5.00%) high severe


C:\Users\ano31\Rust\btree>

Edit: I tried bench-marking a different BTreeMap method ( get ) which doesn't allocate and for that the results are much more stable (generally within 2%), which seems to confirm it is related to allocation.

Oh, I just noticed it is the main timing run that uses 100% CPU, not the warmup. CPU is 100% just before the test finishes, and then goes to zero. So apparently criterion uses multiple threads for the main timing run. Or... maybe it is the analysis that uses 100% CPU. Hmm, based on the messages it outputs, maybe that is what it is.

Not to my knowledge.

It shouldn't. Doing basic statistics over a 100 samples should not take much time

1 Like

While trying to improve my own benchmarks’ stability, I have observed that if you increase nresamples, the analysis can take many seconds. I'm not familiar with the particular method being used here, but I'd say it perhaps isn't “basic” statistics.

3 Likes

At the point where a benchmark starts to be dominated by allocation, it can be quite informative to try them out with a different allocator, especially on Windows where HeapAlloc isn't super great and requires taking a process-global lock. Switching to using jemalloc as your #[global_allocator] can often have an outsized impact for common allocation patterns.

1 Like

I am generally using mimalloc ( which tends to be significantly faster ( 3 to 6 times faster for my clone benchmark than the default allocator ) for testing and benchmarks.

Here are results after temporarily disabling mimalloc then re-enabling it:

C:\Users\ano31\Rust\btree>cargo bench Clone
   Compiling btree_experiment v0.1.96 (C:\Users\ano31\Rust\btree)
    Finished `bench` profile [optimized] target(s) in 6.19s
     Running unittests src\lib.rs (target\release\deps\btree_experiment-c743cbd3baad9b93.exe)

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 92 filtered out; finished in 0.00s

     Running benches\crit_bench.rs (target\release\deps\crit_bench-a9622620b77da297.exe)
Gnuplot not found, using plotters backend
Clone/Exp/1000          time:   [749.59 ns 754.38 ns 759.75 ns]
                        change: [-87.667% -87.535% -87.401%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 17 outliers among 100 measurements (17.00%)
  5 (5.00%) high mild
  12 (12.00%) high severe
Clone/Std/1000          time:   [6.3861 µs 6.4250 µs 6.4696 µs]
                        change: [-78.285% -77.879% -77.522%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  7 (7.00%) high mild
  7 (7.00%) high severe
Clone/Exp/10000         time:   [9.9145 µs 9.9766 µs 10.046 µs]
                        change: [-76.750% -76.419% -75.948%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
  7 (7.00%) high mild
  9 (9.00%) high severe
Clone/Std/10000         time:   [69.813 µs 71.244 µs 73.330 µs]
                        change: [-66.723% -66.052% -65.409%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high severe

The benchmark code is here: