Strange result of a copying benchmark

I was experimenting with microbenchmarking operations and run into somewhat strange behavior - it looks like I had some mistake, but I can't see where, so maybe someone more experienced will be able to help.

The issue is with copying u32s into slices. These can be done in two general ways, depending on the endianness, so I've tried to check if there is any difference in this:

extern crate test;
use test::{black_box, Bencher};

fn loop_copy_with(mut src: u32, dst: &mut [u8], f: fn(u32) -> [u8; 4]) {
    // Without this loop, it seems that I've measured only noise.
    for _i in 0..1_000_000 {
        // This is added so that compiler won't optimize the loop into single iteration.
        src = black_box(src);

fn to_slice(src: u32, dst: &mut [u8]) {
    loop_copy_with(src, dst, u32::to_ne_bytes);

fn to_slice_be(mut src: u32, dst: &mut [u8]) {
    loop_copy_with(src, dst, u32::to_be_bytes);

fn to_slice_le(mut src: u32, dst: &mut [u8]) {
    loop_copy_with(src, dst, u32::to_le_bytes);

fn bench_with(b: &mut Bencher, f: fn(u32, &mut [u8])) {
    let src = black_box(0xfeedcafe);
    let mut slice = [0u8; 4];
    let dst = black_box(&mut slice);
    b.iter(|| {
        f(src, dst)

pub fn copy_to_slice(b: &mut Bencher) {
    bench_with(b, to_slice);

pub fn copy_to_slice_be(b: &mut Bencher) {
    bench_with(b, to_slice_be);

pub fn copy_to_slice_le(b: &mut Bencher) {
    bench_with(b, to_slice_le);

And here's the sample result:

test copy::copy_to_slice                      ... bench:   1,958,617 ns/iter (+/- 26,243)
test copy::copy_to_slice_be                   ... bench:   1,238,975 ns/iter (+/- 37,325)
test copy::copy_to_slice_le                   ... bench:   1,961,901 ns/iter (+/- 556,931)

The target architecture is little-endian, so the fact that it was the same as native-endian isn't surprising at all. But why the big-endian variant, with the additional byte-swapping, is faster? Is this a glitch in benchmarking (and if so, can it be fixed), or does this look like a real difference?

I don't have a good explanation for this, but I did get an interesting result when experimenting with the code. I can reproduce the performance difference in your original code, but when I changed it slightly, the performance difference disappeared:

fn loop_copy_with(dst: &mut [u8], f: fn(u32) -> [u8; 4]) {
    // Without this loop, it seems that I've measured only noise.
    for i in 0..1_000_000 {

fn to_slice(dst: &mut [u8]) {
    loop_copy_with(dst, u32::to_ne_bytes);

fn to_slice_be(dst: &mut [u8]) {
    loop_copy_with(dst, u32::to_be_bytes);

fn to_slice_le(dst: &mut [u8]) {
    loop_copy_with(dst, u32::to_le_bytes);


test copy_to_slice    ... bench:   1,165,651 ns/iter (+/- 134,631)
test copy_to_slice_be ... bench:   1,149,998 ns/iter (+/- 129,459)
test copy_to_slice_le ... bench:   1,139,345 ns/iter (+/- 238,416)

Perhaps there was some subtle effect of code layout or instruction ordering in the original benchmark that produced an outsized difference in this tight loop.

