Black_box reordering weird 5x performance differences

The two functions shown below are nearly identically, except for the order of let mut state = vec![1., 0.]; and test::black_box(123);. But one takes 126 µs while the other takes just 22 µs. Why?

This also happens if instead of black_box, I call a function that is fast and side-effect-free but not inlined because it's inside another crate.

This effect goes away if I replace

let x = state[0];
let v = state[1];
let diff = vec![v, -x];

with let diff = vec![state[1], -state[0]];

pub fn slow() -> Vec<f64> {
    let mut state = vec![1., 0.];
    test::black_box(123);
    for _ in 0..10000 {
        let x = state[0];
        let v = state[1];
        let diff = vec![v, -x];
        for (diff, state) in diff.into_iter().zip(state.iter_mut()) {
            *state = *state + diff * 0.01;
        }
    }
    state
}
pub fn fast() -> Vec<f64> {
    test::black_box(123);
    let mut state = vec![1., 0.];
    for _ in 0..10000 {
        let x = state[0];
        let v = state[1];
        let diff = vec![v, -x];
        for (diff, state) in diff.into_iter().zip(state.iter_mut()) {
            *state = *state + diff * 0.01;
        }
    }
    state
}
1 Like

black_box is basically an optimization barrier, trying to keep anything before it from being known to anything after it. So it makes total sense that the compiler will optimize better in the second case, where it's not blocking any meaningful information.

But also, I'm not sure there's anything meaningful going on here. Vectors that are only ever 2 elements is not something that should be used in perf-sensitive code. What are you actually trying to do? And this is a microbenchmark, so are you certain you're measuring something meaningful? (It's extremely hard to microbenchmark meaningfully.)

7 Likes
  1. We don't pass state as an argument into black_box, so the optimizer knows that state cannot change.
  2. This still does not explain how let diff = vec![state[1], -state[0]]; is faster than
let x = state[0];
let v = state[1];
let diff = vec![v, -x];
  1. A vector with 2 elements can be faster than an array, a tuple, or two seperate variables
  2. I know that this is kind of a micro-benchmark, but it's an implementation of Eulers method and a 5x performance difference is weird, even in a microbenchmark.

I think you're overestimating the control that black_box has. One common implementation is inline assembly marked as "this writes to arbitrary stuff", and with that implementation it doesn't matter what you pass to it, the call invalidates all knowledge (or at least most -- as always with hints there's no guarantee about what it'll specifically do).

Please explain why you think this to be true.

(Note that vec! is the "dynamic array" meaning of vector, not the "simd" meaning.)

The problem with microbenchmarks is that it's not that weird to measure that much of a performance difference spuriously. For example, there was an example a while back where the exact same code got faster or slower depending on uncalled code elsewhere in the program because it change the alignment of the code in the binary.

Isn't writing using assembly to anything that you do not have as a &mut UB?

Because I had some microbenchmark that said so. (It is very weird that a vec outperforms an array).

I'm not sure if I find the code, but (I think that) I recently measured that iterating over a vec with N elements is faster than just having N variables and copy-pasting code.

My other question still stands, how is let diff = vec![state[1], -state[0]]; different from

let x = state[0];
let v = state[1];
let diff = vec![v, -x];

I'd guess it's because the single line version uses state to define diff inline/directly, while the other option reads from state to allocate two new register values x and y, then uses them to create diff, and then drops them. But, the Rust compiler does a lot of its own optimizations so that may not be the full story.

Yes. It's so weird (for a 2-element array) that it makes me suspect a measurement issue.

Please share a benchmark where criterion shows the 2-element vector being faster than a 2-element array. That way we can either find the benchmark problem or open a rustc issue about it.

Well, for one, the order in which you index things can matter. Here's a quick demo: https://rust.godbolt.org/z/ndjKcrf7a

To be fair, this here is a micro benchmark, but (sometimes) a vec outperforms an array:

i5-4590

vec                     time:   [26.030 us 26.095 us 26.157 us]  
array                   time:   [34.850 us 34.926 us 35.001 us] 

i5-4590 with export RUSTFLAGS="-C target-cpu=native"

vec                     time:   [23.127 us 23.169 us 23.211 us] 
array                   time:   [23.449 us 23.501 us 23.559 us]                   

i5-6300U with export RUSTFLAGS="-C target-cpu=native"

vec                     time:   [38.091 us 38.532 us 39.031 us]                 
array                   time:   [39.694 us 40.104 us 40.632 us]                   

i5-6300U

vec                     time:   [35.050 us 35.474 us 35.999 us]                 
array                   time:   [35.067 us 35.539 us 36.164 us]     
fn vec_dgl(state: &Vec<f64>) -> Vec<f64> {
    let x = state[0];
    let v = state[1];
    vec![v, -x]
}
pub fn vec_euler(x: f64, v: f64) -> (f64, f64) {
    let mut state = vec![x, v];
    let dt = 0.01;
    for _ in 0..10000 {
        let diff = vec_dgl(&state);
        // diff.into_iter().zip(state.iter_mut()).for_each(|(diff, state)| *state = *state + diff * dt ); // same performance, [https://doc.rust-lang.org/std/iter/trait.Iterator.html#method.for_each] write: This is equivalent to using a for loop on the iterator
        for (diff, state) in diff.into_iter().zip(state.iter_mut()) {
            *state = *state + diff * dt;
        }
    }
    (state[0], state[1])
}

fn array_dgl(state: [f64; 2]) -> [f64; 2] {
    let x = state[0];
    let v = state[1];
    [v, -x]
}
pub fn array_euler(x: f64, v: f64) -> (f64, f64) {
    let mut state = [x, v];
    let dt = 0.01;
    for _ in 0..10000 {
        let diff = array_dgl(state);
        for (diff, state) in diff.into_iter().zip(state.iter_mut()) {
            *state = *state + diff * dt;
        }
    }
    (state[0], state[1])
}
use criterion::{black_box, criterion_group, criterion_main, Criterion};
use rust::*;

pub fn criterion_benchmark(c: &mut Criterion) {
    c.bench_function("vec", |b| b.iter(|| vec_euler(black_box(1.), black_box(0.))));
    c.bench_function("array", |b| b.iter(|| array_euler(black_box(1.), black_box(0.))));
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

Interesting. When both versions take their arguments by reference, the performance is the same:

@@ -16,7 +16,7 @@ pub fn vec_euler(x: f64, v: f64) -> (f64, f64) {
     (state[0], state[1])
 }
 
-fn array_dgl(state: [f64; 2]) -> [f64; 2] {
+fn array_dgl(state: &[f64; 2]) -> [f64; 2] {
     let x = state[0];
     let v = state[1];
     [v, -x]
@@ -25,7 +25,7 @@ pub fn array_euler(x: f64, v: f64) -> (f64, f64) {
     let mut state = [x, v];
     let dt = 0.01;
     for _ in 0..10000 {
-        let diff = array_dgl(state);
+        let diff = array_dgl(&state);
         for (diff, state) in diff.into_iter().zip(state.iter_mut()) {
             *state = *state + diff * dt;
         }

Results on my x86_64 machine, after that change:

vec                     time:   [20.599 us 20.734 us 20.870 us]                 
array                   time:   [20.611 us 20.733 us 20.864 us]                   

This might be one of those cases where Rust is generating too many memcpys, and LLVM does not manage to optimize them away.

3 Likes

Can confirm

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.