Consistently inconsistent performance? Task either always takes 15ms or 50ms when run repeatedly

I have the following program:

use std::time::Instant;

fn main() {
    for _ in 0..10 {
        let start = Instant::now();
        for _ in 0..100 {
            let mut items = vec![0; 1024 * 256];

            for i in 0..items.len() {
                items[i] = 1;
            }
        }
        let elapsed = Instant::now() - start;
        println!("{:?}", elapsed);
    }
}

When I run it, I either get times of 14.5ms or 50ms, but the times are consistent for each run. I.E. if I run it twice, one run might always show 14.5ms while the other always shows 50ms. Things I have looked into:

  • I have cranked up the iterations on the outer loop to 10,000 and the effect stays consistent - whatever time shows up when the application starts is the time it will keep hitting as it continues.
  • I have closed all other applications on my computer to ensure they don't interfere and the effect still shows.
  • I have ran the binary directly from target/ to ensure cargo isn't changing it between runs.
  • I have disabled hyperthreading and removed all but one stick of RAM and the effect still shows.
  • There is no correlation between fast/slow runs and which core the program is running on (according to htop)
  • This only happens when using a debug build but with optimization level 3. Using no optimization, if the effect is still there, it is drowned out by noise. Using a release build, the effect does not show up.
  • perf shows that, in the slow case, a cmp instruction is running 86% of the time and an add instruction is running 10% of the time. In the fast case, the same cmp instruction is running 80% of the time and the same add instruction is running 11% of the time.

Info:

  • OS is Ubuntu 20.04.2
  • CPU is AMD Ryzen 5 3600 (12) @ 3.600GHz
  • Rust version is 1.52.0-nightly

Sample "fast" run:

20.249347ms
14.946827ms
15.592126ms
15.469634ms
15.595895ms
15.700627ms
15.18378ms
15.081419ms
15.23662ms
15.354563ms
15.475324ms
15.688966ms
15.443524ms
15.920099ms
15.12749ms
15.18383ms
15.121179ms
15.1809ms
15.425873ms
15.572275ms
15.160489ms
23.224846ms
14.911027ms
15.053817ms
15.459823ms
15.251822ms
15.684346ms
15.515915ms
15.26012ms
15.290671ms
16.564018ms
15.896948ms
14.908487ms
15.878409ms
15.861049ms
14.896607ms
15.739077ms
16.284444ms
15.067088ms
15.667027ms
15.812199ms
15.318181ms
15.980461ms
15.657966ms
15.475034ms
15.489454ms
15.554925ms
15.600436ms
16.218023ms
15.863329ms
15.87535ms
15.825108ms
15.1664ms
16.184113ms
15.663907ms
15.880679ms
15.008787ms
16.589609ms
23.366318ms
15.001907ms
15.18311ms
19.220753ms
15.1867ms
15.432154ms
16.873502ms
14.674743ms
14.765175ms
14.777884ms
14.797756ms
14.865805ms
14.911876ms
14.723284ms
14.697333ms
14.776004ms
15.043877ms
15.14965ms
14.850205ms
15.602765ms
14.844195ms
14.926297ms
15.507103ms
15.25596ms
15.153439ms
15.246681ms
14.944817ms
15.413412ms
15.729576ms
15.706237ms
16.101811ms
15.18059ms
15.589735ms
15.253971ms
15.567565ms
15.27175ms
26.397537ms
14.995638ms
15.642586ms
16.200123ms
15.345301ms
14.677693ms
15.149829ms
21.745096ms
15.288681ms
15.597016ms
15.629945ms
15.121729ms
15.621485ms
15.564305ms
16.353355ms
16.034581ms

Sample "slow" run:

52.256082ms
47.592484ms
50.968555ms
50.984696ms
52.573536ms
50.770153ms
43.068017ms
47.915938ms
43.032007ms
44.560616ms
52.408444ms
53.863111ms
52.589966ms
53.163403ms
50.406799ms
54.030184ms
53.894742ms
52.362922ms
52.313553ms
54.124534ms
52.068789ms
54.089154ms
52.076679ms
51.468612ms
50.50714ms
52.257011ms
53.105012ms
51.006516ms
53.211423ms
54.381718ms
50.893135ms
49.470986ms
54.007163ms
52.481154ms
51.572123ms
54.346257ms
52.387704ms
50.136515ms
50.203384ms
54.235326ms
49.555918ms
52.787127ms
50.293756ms
44.935129ms
49.654259ms
52.91538ms
54.969694ms
54.512049ms
56.898899ms
48.484264ms
53.092131ms
52.067778ms
54.246196ms
54.165065ms
51.964927ms
52.308851ms
53.79712ms
52.309061ms
52.015257ms
51.593512ms
52.935439ms
51.953866ms
53.88526ms
51.38511ms
52.130588ms
54.108263ms
48.777298ms
48.344721ms
49.090792ms
49.696338ms
45.473746ms
54.418648ms
50.116674ms
50.202755ms
45.073971ms
52.826778ms
50.949824ms
45.696538ms
46.238945ms
47.903816ms
48.609675ms
53.308464ms
53.031829ms
53.889511ms
54.351236ms
51.019335ms
52.780546ms
50.107864ms
51.717913ms
50.494289ms
53.008639ms
50.575049ms
52.19422ms
46.086803ms
53.310303ms
52.524953ms
53.482225ms
54.141443ms
52.811996ms
52.164508ms
53.754308ms
50.721071ms
53.394544ms
52.197339ms
50.516998ms
53.246762ms
54.014482ms
52.531543ms
52.469533ms
51.431919ms

Just a wild guess, maybe I'm totally off... Maybe it depends on where the vec lands relative to page boundaries or something like that? Perhaps this makes a difference due to different ability to do caching. You might want to print the address or the vec's allocation (e. g. by allocating a same sized vec at the very beginning of the program, printing the address, and dropping it again). Maybe there's a correlation between the lowest few bits/bytes of that address and whether it's a fast or slow run.

3 Likes

Doesn't seem to be the case:

use std::time::Instant;

/*
Slow:
0x7fdb344c1010
0x7fb2a865d010
0x7f4dd7fcd010
0x7fb2300b4010
0x7efc8ccad010
0x7f9a463b7010
0x7f50f2d97010
0x7ff9af554010
0x7ff81e382010
Fast:
0x7f9cfd1d4010
0x7fc50f490010
0x7fd4a3032010
0x7efe01687010
0x7f95875f7010
*/

fn main() {
    let mut items = vec![0; 1024 * 256];
    println!("{:?}", &items[0] as *const i32);
    for _ in 0..10_000 {
        let start = Instant::now();
        for _ in 0..100 {
            for i in 0..items.len() {
                items[i] = 1;
            }
        }
        let elapsed = Instant::now() - start;
        println!("{:?}", elapsed);
    }
}

Could you try moving the content of the main function into a different one and calling it multiple times?

Also, just a very wild guess: maybe what you are seeing is whether or not the os has to clear a pave for your process while performing the allocation? This could be tested by moving the allocation outside the time measurement.

Edit: the last sentence might actually not be true because the os could tell you where to assume a page was given but only initialize it on the first read/write

Edit2: is the behavior still observable if you remove the writes to the array? My guess would be yes, as the writes should be optimized out without any reads using O3

Both moving to a separate function and moving the allocation outside the measurement produce the same result as before. Removing the writes results in times dropping to near zero (20-30 nanoseconds)

Very very wild idea, but could you print the address of the main function?...

Regarding the writes: interesting behavior. I thought llvm is very aggressive when it comes to removing writes to memory that is not read anymore. Maybe the bounds check inhibit the optimization in this case? Either using a iterator version or an unsafe version might help in this case.

Doesn't seem to show a pattern:

use std::time::Instant;

/*
Slow:
0x55e69b11d210
0x56038f688210
0x555a9fb56210
0x559ecdeee210
0x55cf0be99210
0x558597281210
Fast:
0x562e47469210
0x562c70454210
0x562f6b3f0210
0x55a45c98d210
0x55d9de9e2210
*/

fn main() {
    println!("{:p}", main as fn());
    for _ in 0..10_000 {
        let mut items = vec![0; 1024 * 256];
        let start = Instant::now();
        for _ in 0..100 {
            for i in 0..items.len() {
                items[i] = 1;
            }
        }
        let elapsed = Instant::now() - start;
        println!("{:?}", elapsed);
    }
}

Iterator version is consistently much faster, stops displaying the variation:

use std::time::Instant;

fn main() {
    for _ in 0..10_000 {
        let mut items = vec![0; 1024 * 256];
        let start = Instant::now();
        for _ in 0..100 {
            for item in &mut items {
                *item = 1;
            }
        }
        let elapsed = Instant::now() - start;
        println!("{:?}", elapsed);
    }
}

I see the same generated code on Godbolt with -C opt-level=3 -C debug-assertions=y:

And at opt-level=2 the outer loop increment just gets moved from the top of the loop to the bottom.

Is it the same on your computer? I'm not sure I hit all the rustc flags to mirror your setup.

-g is also required, but that doesn't seem to cause the examples to diverge.

Rather than printing out a handful of presumably representative addresses and squinting to see a pattern in them, try disabling ASLR entirely and see if the randomness goes away.

2 Likes

Are you sure this is not caused by CPU throttling? Maybe you see fast runs when the CPU is cool and not throttled, and slow runs when it is hot and throttled?

1 Like

That is why OP should use better benchmarking framework.

In my experience, fancy benchmarking tools like criterion or hyperfine were not effective at discovering throttling. What was effective is running GitHub - amanusk/s-tui: Terminal-based CPU stress and monitoring utility during the benchmark to see if CPU is actually running at the full speed.

1 Like

The thing Iā€d try here is perf stat: the summary it prints often catches the underlying reason for benchmark results.

https://man7.org/linux/man-pages/man1/perf-stat.1.html

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.