`copy_from_slice` benchmarking very slow

Hi, I'm new to Rust and I've been playing around with how to copy parts of arrays. I've come up with the following three possibilities:

use std::convert::TryInto;

fn main() {
    let foo = [1, 2, 3, 4, 5, 6, 7, 8];

    println!("{:?}", try_into(&foo));
    println!("{:?}", explicit(&foo));
    println!("{:?}", copy_from_slice(&foo));
}

fn try_into(foo: &[u8; 8]) -> [u8; 4] {
    let bar: [u8; 4] = foo[0..4].try_into().unwrap();
    bar
}

fn explicit(foo: &[u8; 8]) -> [u8; 4] {
    let bar = [foo[2], foo[3], foo[4], foo[5]];
    bar
}

fn copy_from_slice(foo: &[u8; 8]) -> [u8; 4] {
    let mut bar = [0u8; 4];
    bar.copy_from_slice(&foo[4..8]);
    bar
}

I didn't really like any of those (if I've missed alternatives I'd be happy to hear them), so I thought to run them through a benchmark to see how they performed:

#[bench]
fn bench_try_into(b: &mut Bencher) {
    let foo = [1, 2, 3, 4, 5, 6, 7, 8];
    b.iter(|| {
        for _ in 0..10000 {
            black_box(try_into(&foo));
        }
    });
}

#[bench]
fn bench_explicit(b: &mut Bencher) {
    let foo = [1, 2, 3, 4, 5, 6, 7, 8];
    b.iter(|| {
        for _ in 0..10000 {
            black_box(explicit(&foo));
        }
    });
}

#[bench]
fn bench_copy_from_slice(b: &mut Bencher) {
    let foo = [1, 2, 3, 4, 5, 6, 7, 8];
    b.iter(|| {
        for _ in 0..10000 {
            black_box(copy_from_slice(&foo));
        }
    });
}
Fully runnable code
#![feature(test)]

use std::convert::TryInto;

fn main() {
    let foo = [1, 2, 3, 4, 5, 6, 7, 8];

    println!("{:?}", try_into(&foo));
    println!("{:?}", explicit(&foo));
    println!("{:?}", copy_from_slice(&foo));
}

fn try_into(foo: &[u8; 8]) -> [u8; 4] {
    let bar: [u8; 4] = foo[0..4].try_into().unwrap();
    bar
}

fn explicit(foo: &[u8; 8]) -> [u8; 4] {
    let bar = [foo[2], foo[3], foo[4], foo[5]];
    bar
}

fn copy_from_slice(foo: &[u8; 8]) -> [u8; 4] {
    let mut bar = [0u8; 4];
    bar.copy_from_slice(&foo[4..8]);
    bar
}

#[cfg(test)]
mod tests {
    extern crate test;
    use super::*;
    use test::{Bencher, black_box};
    
    #[bench]
    fn bench_try_into(b: &mut Bencher) {
        let foo = [1, 2, 3, 4, 5, 6, 7, 8];
        b.iter(|| {
            for _ in 0..10000 {
                black_box(try_into(&foo));
            }
        });
    }
    
    #[bench]
    fn bench_explicit(b: &mut Bencher) {
        let foo = [1, 2, 3, 4, 5, 6, 7, 8];
        b.iter(|| {
            for _ in 0..10000 {
                black_box(explicit(&foo));
            }
        });
    }
    
    #[bench]
    fn bench_copy_from_slice(b: &mut Bencher) {
        let foo = [1, 2, 3, 4, 5, 6, 7, 8];
        b.iter(|| {
            for _ in 0..10000 {
                black_box(copy_from_slice(&foo));
            }
        });
    }
}

I was very surprised by the results. try_into and exlicit seem to be identical. copy_from_slice however is approximately 15 times(!) as slow on my machine:

$ rustup run nightly cargo bench
    Finished bench [optimized] target(s) in 0.00s
     Running target/release/deps/test_crate-fdf4af69eb357b62

running 3 tests
test tests::bench_copy_from_slice ... bench:      47,334 ns/iter (+/- 1,071)
test tests::bench_explicit        ... bench:       3,176 ns/iter (+/- 226)
test tests::bench_try_into        ... bench:       3,192 ns/iter (+/- 242)

Is there an obvious mistake I'm making? And if not, what is the reason that copy_from_slice is that much slower than the other two options? It's documentation states that it uses memcpy, so I would have expected only small differences.

All of your functions do different things, so I changed them all to have the same behavior, and they all generate exactly the same asm.

use std::convert::TryInto;

pub fn try_into(foo: &[u8; 8]) -> [u8; 4] {
    let bar: [u8; 4] = foo[4..8].try_into().unwrap();
    bar
}

pub fn explicit(foo: &[u8; 8]) -> [u8; 4] {
    let bar = [foo[4], foo[5], foo[6], foo[7]];
    bar
}

pub fn copy_from_slice(foo: &[u8; 8]) -> [u8; 4] {
    let mut bar = [0u8; 4];
    bar.copy_from_slice(&foo[4..8]);
    bar
}

For operations this short, micro benchmarks are really bad at telling what's actually going on, it's better to analyze the assembly. For example, you fully runable benchmark on my machine gives these three results from just shuffling the order of the benchmarks (the middle benchmark is always about 2x slower than the other two).

running 3 tests
test tests::bench_copy_from_slice ... bench:       5,589 ns/iter (+/- 451)
test tests::bench_explicit        ... bench:      11,080 ns/iter (+/- 423)
test tests::bench_try_into        ... bench:       5,590 ns/iter (+/- 739)
running 3 tests
test tests::bench_copy_from_slice ... bench:       5,674 ns/iter (+/- 1,169)
test tests::bench_explicit        ... bench:       8,323 ns/iter (+/- 9,960)
test tests::bench_try_into        ... bench:      11,002 ns/iter (+/- 1,396)
running 3 tests
test tests::bench_copy_from_slice ... bench:      11,057 ns/iter (+/- 741)
test tests::bench_explicit        ... bench:       5,652 ns/iter (+/- 5,443)
test tests::bench_try_into        ... bench:       5,629 ns/iter (+/- 1,778)

Hmm, guess that makes sense. How do I control order of execution?

I just swapped the order of the benchmark functions in the source file

I tried benchmarking this with criterion and got this

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

     Running target\release\deps\my_benchmark-f8eb2ab076a3d495.exe
bench_explicit          time:   [12.600 us 12.650 us 12.711 us]
Found 11 outliers among 100 measurements (11.00%)
  2 (2.00%) high mild
  9 (9.00%) high severe

bench_copy_from_slice   time:   [12.558 us 12.602 us 12.660 us]
Found 13 outliers among 100 measurements (13.00%)
  2 (2.00%) high mild
  11 (11.00%) high severe

bench_try_into          time:   [12.690 us 12.802 us 12.930 us]
Found 14 outliers among 100 measurements (14.00%)
  1 (1.00%) high mild
  13 (13.00%) high severe

Changing the order of the benchmarks seems to have no effect in my case. But it's hard to argue against identical ASM, which I can reproduce on my machine. I'll look into criterion and maybe I'll take a look what cargo bench is doing under the hood.

Thank you.

Um with #[bench], source order shouldn't matter, it runs them in alphabethical order. So it's just another kind of demonstration of how results vary between runs(?).

1 Like

I have tried again by prefixing the benchmarks with a, b and c and trying all combinations. I still see copy_from_slice being consistently 15 times as slow, no matter the order.

Switching to criterion on the other hand gives me basically the same result for all three functions, as expected.

When I have some time I'll try to investigate the behavior of of #[bench] more closely. After letting this sink some more, it definitely seems fishy to me how consistent the timings are across many runs. If we consider the results to be worthless due to having to little work in the benchmark, then the results should be more random I feel!

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.