Zero cost wrapper type?

#1

Hello,
I was just wondering if it’s truly possible to create a zero overhead wrapper type (alright, one of my friends challenged me lmao). The best attempt is here, but it’s still significantly slower. Any optimization ideas?
Thanks,
ndrewxie

#2
Wrapper took 2340364
Usize took 2349386
Difference is -0.38%

My change is to loop 300000

#3

I made the wrapper faster.

Wrapper took 2880
Usize took 3701
Difference is -22.18%

My best guess is the ThreadRng does some initialization on first call that was being attributed to the wrapper since you call it first, swapping the call order then attributes it to the non-wrapped value instead.

1 Like
#4

Don’t write your own test harness with elapsed — it’s too easy to get misleading results this way. Use Bencher or something else that will adjust number of iterations and calculate stddev, so that you can avoid noisy results.

3 Likes
#5

Echoing what others have said, your benchmark method greatly favors whatever doesn’t have to initialize ThreadRng.

The playground can output generated code, and you can see in this playground by asking to generate asm that the result is the same. In fact, the linker simply merges the two functions.

#6

Using Bencher instead, the code could look like this:

#![feature(test)]
extern crate test;

use rand::Rng;
use test::{black_box, Bencher};

struct UsizeWrapper {
    data : usize,
}

impl UsizeWrapper {
    #[inline(always)]
    pub fn from(dat : usize) -> Self {
        Self {
            data : dat,
        }
    }

    #[inline(always)]
    pub fn get_data(&self) -> usize {
        self.data
    }
}

#[bench]
fn bench_usize(b: &mut Bencher) {
    let mut rng = rand::thread_rng();
    b.iter(|| {
        for _ in 0..300 {
            let y = UsizeWrapper::from(rng.gen());
            black_box(y.get_data() + 5);
        }
    });
}

#[bench]
fn bench_wrapper(b: &mut Bencher) {
    let mut rng = rand::thread_rng();
    b.iter(|| {
        for _ in 0..300 {
            let y: usize = rng.gen();
            black_box(y + 5);
        }
    });
}

#[bench]
fn bench_nothing(b: &mut Bencher) {
    let mut rng = rand::thread_rng();
    b.iter(|| {
        for _ in 0..300 {
            black_box(rng.gen::<usize>());
        }
    });
}

On my machine, the version that skips the addition after generating a random number is about 1 percent faster, so 99 percent of the time is spent generating the random number. So even when using a better tool to time the code, it still matters what you time, and this code is completely dominated by the random number generator.

#7

Strange, I moved the random number stuff outside of the loop (now it’s precomputed and then pushed into a vector). Even after that, though, whatever test runs first is still faster. Code is here. Any ideas why?

#8

This could be due to dynamic frequency scaling increasing the clock rate of the CPU as the workload increases.

#9

aaah, makes sense. Thanks!