Black_box reordering weird 5x performance differences

Volker_Weissmann · June 20, 2021, 1:09pm

The two functions shown below are nearly identically, except for the order of let mut state = vec![1., 0.]; and test::black_box(123);. But one takes 126 µs while the other takes just 22 µs. Why?

This also happens if instead of black_box, I call a function that is fast and side-effect-free but not inlined because it's inside another crate.

This effect goes away if I replace

let x = state[0];
let v = state[1];
let diff = vec![v, -x];

with let diff = vec![state[1], -state[0]];

pub fn slow() -> Vec<f64> {
    let mut state = vec![1., 0.];
    test::black_box(123);
    for _ in 0..10000 {
        let x = state[0];
        let v = state[1];
        let diff = vec![v, -x];
        for (diff, state) in diff.into_iter().zip(state.iter_mut()) {
            *state = *state + diff * 0.01;
        }
    }
    state
}
pub fn fast() -> Vec<f64> {
    test::black_box(123);
    let mut state = vec![1., 0.];
    for _ in 0..10000 {
        let x = state[0];
        let v = state[1];
        let diff = vec![v, -x];
        for (diff, state) in diff.into_iter().zip(state.iter_mut()) {
            *state = *state + diff * 0.01;
        }
    }
    state
}

scottmcm · June 20, 2021, 5:12pm

black_box is basically an optimization barrier, trying to keep anything before it from being known to anything after it. So it makes total sense that the compiler will optimize better in the second case, where it's not blocking any meaningful information.

But also, I'm not sure there's anything meaningful going on here. Vectors that are only ever 2 elements is not something that should be used in perf-sensitive code. What are you actually trying to do? And this is a microbenchmark, so are you certain you're measuring something meaningful? (It's extremely hard to microbenchmark meaningfully.)

Volker_Weissmann · June 26, 2021, 5:35pm

We don't pass state as an argument into black_box, so the optimizer knows that state cannot change.
This still does not explain how let diff = vec![state[1], -state[0]]; is faster than

let x = state[0];
let v = state[1];
let diff = vec![v, -x];

A vector with 2 elements can be faster than an array, a tuple, or two seperate variables
I know that this is kind of a micro-benchmark, but it's an implementation of Eulers method and a 5x performance difference is weird, even in a microbenchmark.

scottmcm · June 26, 2021, 11:34pm

I think you're overestimating the control that black_box has. One common implementation is inline assembly marked as "this writes to arbitrary stuff", and with that implementation it doesn't matter what you pass to it, the call invalidates all knowledge (or at least most -- as always with hints there's no guarantee about what it'll specifically do).

Please explain why you think this to be true.

(Note that vec! is the "dynamic array" meaning of vector, not the "simd" meaning.)

The problem with microbenchmarks is that it's not that weird to measure that much of a performance difference spuriously. For example, there was an example a while back where the exact same code got faster or slower depending on uncalled code elsewhere in the program because it change the alignment of the code in the binary.

Volker_Weissmann · June 27, 2021, 12:17am

Isn't writing using assembly to anything that you do not have as a &mut UB?

Because I had some microbenchmark that said so. (It is very weird that a vec outperforms an array).

I'm not sure if I find the code, but (I think that) I recently measured that iterating over a vec with N elements is faster than just having N variables and copy-pasting code.

My other question still stands, how is let diff = vec![state[1], -state[0]]; different from

let x = state[0];
let v = state[1];
let diff = vec![v, -x];

PFaas · June 27, 2021, 9:29am

I'd guess it's because the single line version uses state to define diff inline/directly, while the other option reads from state to allocate two new register values x and y, then uses them to create diff, and then drops them. But, the Rust compiler does a lot of its own optimizations so that may not be the full story.

scottmcm · June 27, 2021, 4:38pm

Yes. It's so weird (for a 2-element array) that it makes me suspect a measurement issue.

Please share a benchmark where criterion shows the 2-element vector being faster than a 2-element array. That way we can either find the benchmark problem or open a rustc issue about it.

Well, for one, the order in which you index things can matter. Here's a quick demo: https://rust.godbolt.org/z/ndjKcrf7a

Volker_Weissmann · June 27, 2021, 9:16pm

To be fair, this here is a micro benchmark, but (sometimes) a vec outperforms an array:

i5-4590

vec                     time:   [26.030 us 26.095 us 26.157 us]  
array                   time:   [34.850 us 34.926 us 35.001 us]

i5-4590 with export RUSTFLAGS="-C target-cpu=native"

vec                     time:   [23.127 us 23.169 us 23.211 us] 
array                   time:   [23.449 us 23.501 us 23.559 us]

i5-6300U with export RUSTFLAGS="-C target-cpu=native"

vec                     time:   [38.091 us 38.532 us 39.031 us]                 
array                   time:   [39.694 us 40.104 us 40.632 us]

i5-6300U

vec                     time:   [35.050 us 35.474 us 35.999 us]                 
array                   time:   [35.067 us 35.539 us 36.164 us]

fn vec_dgl(state: &Vec<f64>) -> Vec<f64> {
    let x = state[0];
    let v = state[1];
    vec![v, -x]
}
pub fn vec_euler(x: f64, v: f64) -> (f64, f64) {
    let mut state = vec![x, v];
    let dt = 0.01;
    for _ in 0..10000 {
        let diff = vec_dgl(&state);
        // diff.into_iter().zip(state.iter_mut()).for_each(|(diff, state)| *state = *state + diff * dt ); // same performance, [https://doc.rust-lang.org/std/iter/trait.Iterator.html#method.for_each] write: This is equivalent to using a for loop on the iterator
        for (diff, state) in diff.into_iter().zip(state.iter_mut()) {
            *state = *state + diff * dt;
        }
    }
    (state[0], state[1])
}

fn array_dgl(state: [f64; 2]) -> [f64; 2] {
    let x = state[0];
    let v = state[1];
    [v, -x]
}
pub fn array_euler(x: f64, v: f64) -> (f64, f64) {
    let mut state = [x, v];
    let dt = 0.01;
    for _ in 0..10000 {
        let diff = array_dgl(state);
        for (diff, state) in diff.into_iter().zip(state.iter_mut()) {
            *state = *state + diff * dt;
        }
    }
    (state[0], state[1])
}

use criterion::{black_box, criterion_group, criterion_main, Criterion};
use rust::*;

pub fn criterion_benchmark(c: &mut Criterion) {
    c.bench_function("vec", |b| b.iter(|| vec_euler(black_box(1.), black_box(0.))));
    c.bench_function("array", |b| b.iter(|| array_euler(black_box(1.), black_box(0.))));
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

mbrubeck · June 29, 2021, 6:47pm

Interesting. When both versions take their arguments by reference, the performance is the same:

@@ -16,7 +16,7 @@ pub fn vec_euler(x: f64, v: f64) -> (f64, f64) {
     (state[0], state[1])
 }
 
-fn array_dgl(state: [f64; 2]) -> [f64; 2] {
+fn array_dgl(state: &[f64; 2]) -> [f64; 2] {
     let x = state[0];
     let v = state[1];
     [v, -x]
@@ -25,7 +25,7 @@ pub fn array_euler(x: f64, v: f64) -> (f64, f64) {
     let mut state = [x, v];
     let dt = 0.01;
     for _ in 0..10000 {
-        let diff = array_dgl(state);
+        let diff = array_dgl(&state);
         for (diff, state) in diff.into_iter().zip(state.iter_mut()) {
             *state = *state + diff * dt;
         }

Results on my x86_64 machine, after that change:

vec                     time:   [20.599 us 20.734 us 20.870 us]                 
array                   time:   [20.611 us 20.733 us 20.864 us]

This might be one of those cases where Rust is generating too many memcpys, and LLVM does not manage to optimize them away.

Volker_Weissmann · June 29, 2021, 9:36pm

Can confirm

system · September 27, 2021, 9:37pm

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
How Vec or what is optimized here? code review	6	292	April 15, 2024
How can i speed up the div?	24	580	April 6, 2023
Strange result of a copying benchmark	2	350	August 12, 2020
Why this simple piece of code is slower than python? What am I missing? help	13	540	February 23, 2023
Performance questions code review	16	844	September 30, 2020

Black_box reordering weird 5x performance differences

Related Topics