I was experimenting with microbenchmarking operations and run into somewhat strange behavior - it looks like I had some mistake, but I can't see where, so maybe someone more experienced will be able to help.
The issue is with copying u32
s into slices. These can be done in two general ways, depending on the endianness, so I've tried to check if there is any difference in this:
#![feature(test)]
extern crate test;
use test::{black_box, Bencher};
fn loop_copy_with(mut src: u32, dst: &mut [u8], f: fn(u32) -> [u8; 4]) {
// Without this loop, it seems that I've measured only noise.
for _i in 0..1_000_000 {
dst.copy_from_slice(&f(src));
// This is added so that compiler won't optimize the loop into single iteration.
src = black_box(src);
}
}
fn to_slice(src: u32, dst: &mut [u8]) {
loop_copy_with(src, dst, u32::to_ne_bytes);
}
fn to_slice_be(mut src: u32, dst: &mut [u8]) {
loop_copy_with(src, dst, u32::to_be_bytes);
}
fn to_slice_le(mut src: u32, dst: &mut [u8]) {
loop_copy_with(src, dst, u32::to_le_bytes);
}
fn bench_with(b: &mut Bencher, f: fn(u32, &mut [u8])) {
let src = black_box(0xfeedcafe);
let mut slice = [0u8; 4];
let dst = black_box(&mut slice);
b.iter(|| {
f(src, dst)
});
}
#[bench]
pub fn copy_to_slice(b: &mut Bencher) {
bench_with(b, to_slice);
}
#[bench]
pub fn copy_to_slice_be(b: &mut Bencher) {
bench_with(b, to_slice_be);
}
#[bench]
pub fn copy_to_slice_le(b: &mut Bencher) {
bench_with(b, to_slice_le);
}
And here's the sample result:
test copy::copy_to_slice ... bench: 1,958,617 ns/iter (+/- 26,243)
test copy::copy_to_slice_be ... bench: 1,238,975 ns/iter (+/- 37,325)
test copy::copy_to_slice_le ... bench: 1,961,901 ns/iter (+/- 556,931)
The target architecture is little-endian, so the fact that it was the same as native-endian isn't surprising at all. But why the big-endian variant, with the additional byte-swapping, is faster? Is this a glitch in benchmarking (and if so, can it be fixed), or does this look like a real difference?