Strange result of a copying benchmark

Cerber-Ursi · May 14, 2020, 4:48pm

I was experimenting with microbenchmarking operations and run into somewhat strange behavior - it looks like I had some mistake, but I can't see where, so maybe someone more experienced will be able to help.

The issue is with copying u32s into slices. These can be done in two general ways, depending on the endianness, so I've tried to check if there is any difference in this:

#![feature(test)]
extern crate test;
use test::{black_box, Bencher};

fn loop_copy_with(mut src: u32, dst: &mut [u8], f: fn(u32) -> [u8; 4]) {
    // Without this loop, it seems that I've measured only noise.
    for _i in 0..1_000_000 {
        dst.copy_from_slice(&f(src));
        // This is added so that compiler won't optimize the loop into single iteration.
        src = black_box(src);
    }
}

fn to_slice(src: u32, dst: &mut [u8]) {
    loop_copy_with(src, dst, u32::to_ne_bytes);
}

fn to_slice_be(mut src: u32, dst: &mut [u8]) {
    loop_copy_with(src, dst, u32::to_be_bytes);
}

fn to_slice_le(mut src: u32, dst: &mut [u8]) {
    loop_copy_with(src, dst, u32::to_le_bytes);
}

fn bench_with(b: &mut Bencher, f: fn(u32, &mut [u8])) {
    let src = black_box(0xfeedcafe);
    let mut slice = [0u8; 4];
    let dst = black_box(&mut slice);
    b.iter(|| {
        f(src, dst)
    });
}

#[bench]
pub fn copy_to_slice(b: &mut Bencher) {
    bench_with(b, to_slice);
}

#[bench]
pub fn copy_to_slice_be(b: &mut Bencher) {
    bench_with(b, to_slice_be);
}

#[bench]
pub fn copy_to_slice_le(b: &mut Bencher) {
    bench_with(b, to_slice_le);
}

And here's the sample result:

test copy::copy_to_slice                      ... bench:   1,958,617 ns/iter (+/- 26,243)
test copy::copy_to_slice_be                   ... bench:   1,238,975 ns/iter (+/- 37,325)
test copy::copy_to_slice_le                   ... bench:   1,961,901 ns/iter (+/- 556,931)

The target architecture is little-endian, so the fact that it was the same as native-endian isn't surprising at all. But why the big-endian variant, with the additional byte-swapping, is faster? Is this a glitch in benchmarking (and if so, can it be fixed), or does this look like a real difference?

mbrubeck · May 14, 2020, 6:45pm

I don't have a good explanation for this, but I did get an interesting result when experimenting with the code. I can reproduce the performance difference in your original code, but when I changed it slightly, the performance difference disappeared:

fn loop_copy_with(dst: &mut [u8], f: fn(u32) -> [u8; 4]) {
    // Without this loop, it seems that I've measured only noise.
    for i in 0..1_000_000 {
        dst.copy_from_slice(&f(i));
        black_box(&dst);
    }
}

fn to_slice(dst: &mut [u8]) {
    loop_copy_with(dst, u32::to_ne_bytes);
}

fn to_slice_be(dst: &mut [u8]) {
    loop_copy_with(dst, u32::to_be_bytes);
}

fn to_slice_le(dst: &mut [u8]) {
    loop_copy_with(dst, u32::to_le_bytes);
}

Results:

test copy_to_slice    ... bench:   1,165,651 ns/iter (+/- 134,631)
test copy_to_slice_be ... bench:   1,149,998 ns/iter (+/- 129,459)
test copy_to_slice_le ... bench:   1,139,345 ns/iter (+/- 238,416)

Perhaps there was some subtle effect of code layout or instruction ordering in the original benchmark that produced an outsized difference in this tight loop.

system · August 12, 2020, 6:45pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to make good use of black_box from the test crate help	2	881	January 12, 2023
`copy_from_slice` benchmarking very slow help	8	1080	August 30, 2020
A weird benchmark result about same codes but located separately	4	406	November 19, 2021
Black_box reordering weird 5x performance differences help	10	662	September 27, 2021
T:copy, slice, copy, UB?	9	389	October 23, 2022

Strange result of a copying benchmark

Related Topics