Why `Box<dyn Fn()>` is the same fast as normal `fn`

In my knowledge, Box<dyn Fn()> is a fat point, each time it is called, it looks up its vtable. So it would be slower when Box<dyn Fn()> is callled comparing to fn() is called.
But in this test, I fould the two methods have the same performance.
The following is the code:

pub struct ForTest(pub Vec<f32>);

pub trait DynMethod {
    fn iter_func_dyn<'a>(&'a self) -> Box<dyn Fn(usize) -> bool + 'a>;
    fn iter_all_dyn(&self) -> Vec<bool>;
}

pub trait FuncMethod {
    fn iter_func_func(&self, i: usize) -> bool;
    fn iter_all_func(&self) -> Vec<bool>;
}

impl DynMethod for ForTest {
    fn iter_func_dyn<'a>(&'a self) -> Box<dyn Fn(usize) -> bool + 'a> {
        let data = &self.0;
        Box::new(move |i| data[i] > 10.)
    }
    fn iter_all_dyn(&self) -> Vec<bool> {
        let f = self.iter_func_dyn();
        let mut res = vec![];
        for i in 0..self.0.len() {
            res.push(f(i));
        }
        res
    }
}

impl FuncMethod for ForTest {
    fn iter_func_func(&self, i: usize) -> bool {
        self.0[i] > 10.
    }
    fn iter_all_func(&self) -> Vec<bool> {
        let mut res = vec![];
        for i in 0..self.0.len() {
            res.push(self.iter_func_func(i))
        }
        res
    }
}

macro_rules! t {
    ($($tt:tt)+) => {
        let timer = std::time::Instant::now();
        $($tt)+;
        println!("{:?}", timer.elapsed());
    }
}

fn main() {
    let data = vec![11f32; 1_000_000];
    let for_test = ForTest(data);
    t!(for_test.iter_all_dyn());
    t!(for_test.iter_all_func());
}

34.638314ms
30.731192ms

I thought compiler have done something to optimize the code. If it is true, in what situation the compiler can do it, and in what situation the compiler can not?

llvm has a devirtualization pass. Like many other passes, you shouldn't rely on it always working or always not working for any particular case.

5 Likes

Here's the assembly diff for the two versions of the code:

The relevant difference is in the iter_all_* code: although the iter_all_dyn code is longer, that's seemly largely because the _dyn version needs to allocate a Box<dyn Fn>. If you compare the loop body, between the jump labels, there's no external calls and they are mostly the same with only a few extra fetches in the dyn case, which is probably more about grabbing the locals out of the Box<dyn Fn> that it devirtualized and inlined the function for, but not the local allocation (interestingly).

In any case, the listed performance difference is more than 10% - normally that's statistically significant, but you should try using hyperfine to verify on your machine), for me I get:

$ hyperfine ".\target\release\rust-scratch-2 {N}" -L N dyn,func --warmup 2
Benchmark 1: .\target\release\rust-scratch-2 dyn
  Time (mean ± σ):       9.9 ms ±   1.7 ms    [User: 1.2 ms, System: 2.2 ms]
  Range (min … max):     7.5 ms …  15.7 ms    163 runs

Benchmark 2: .\target\release\rust-scratch-2 func
  Time (mean ± σ):       9.9 ms ±   1.6 ms    [User: 2.3 ms, System: 2.9 ms]
  Range (min … max):     7.3 ms …  15.3 ms    159 runs

Summary
  .\target\release\rust-scratch-2 dyn ran
    1.00 ± 0.24 times faster than .\target\release\rust-scratch-2 func

That is, they're basically identical, with a lot of noise. Bumping the data size up by a 100x:

$ hyperfine ".\target\release\rust-scratch-2 {N}" -L N dyn,func --warmup 2
Benchmark 1: .\target\release\rust-scratch-2 dyn
  Time (mean ± σ):     200.5 ms ±   3.9 ms    [User: 82.6 ms, System: 36.6 ms]
  Range (min … max):   196.4 ms … 209.3 ms    13 runs

Benchmark 2: .\target\release\rust-scratch-2 func
  Time (mean ± σ):     193.9 ms ±   4.8 ms    [User: 77.8 ms, System: 26.5 ms]
  Range (min … max):   186.8 ms … 203.7 ms    15 runs

Summary
  .\target\release\rust-scratch-2 func ran
    1.03 ± 0.03 times faster than .\target\release\rust-scratch-2 dyn

makes the noise go down, but they're still basically identical.


edit: I added some std::hint::black_box()s to prevent the inlining around the actual f in both cases (lifting the func case out as let f = std::hint::black_box(|i| self.iter_func_func(i)), and the performance difference is now statistically significant, though still small and about what I'd expect at ~25%:

$ hyperfine ".\target\release\rust-scratch-2 {N}" -L N dyn,func --warmup 2
Benchmark 1: .\target\release\rust-scratch-2 dyn
  Time (mean ± σ):     263.0 ms ±   6.8 ms    [User: 113.0 ms, System: 35.0 ms]
  Range (min … max):   254.4 ms … 277.2 ms    11 runs

Benchmark 2: .\target\release\rust-scratch-2 func
  Time (mean ± σ):     208.9 ms ±   3.5 ms    [User: 85.3 ms, System: 27.4 ms]
  Range (min … max):   204.2 ms … 214.6 ms    14 runs

Summary
  .\target\release\rust-scratch-2 func ran
    1.26 ± 0.04 times faster than .\target\release\rust-scratch-2 dyn
6 Likes

Thanks for your detailed research process, it is difficult but really helpful in growing deep understanding of rust.
Is it the inline that makes the two methods being at the same performance level? If it is true, can I say: If I can assure this Box<dyn Fn()> can be inlined, then I can assure this Box<dyn Fn()> has not impact on the performance? I'm using Box<dyn Fn()> a lot, so I am trying to make sure each of them wouldn't slow down my project.

No, it's devirtualization.

Don't worry about it. Virtual calls are usually not the bottleneck. Only optimize after measuring.

10 Likes

It's important to compare like-for-like functionality, so for example a Box<dyn Fn()> is more general than a fn(Foo), because the latter only has one possible state parameter Foo, while the former can store anything.

As an example of a fairer comparison, if you have something like the classic geometry hit-test example:

type HitTest = Box<dyn Fn(f32, f32) -> bool>;

fn make_hit_circle(cx: f32, cy: f32, r: f32) -> HitTest {
  Box::new(|x, y| {
    let dx = cx - x;
    let dy = cy - y;
    dx * dx + dy * dy < r * r
  })
}

fn make_hit_rect(minx: f32, miny: f32, maxx: f32, maxy: f32) -> HitTest {
  Box::new(|x, y| minx <= x && miny <= y && x <= maxx && y <= maxy)
}

You might be faster using enums instead:

enum Shape {
  Circle { cx: f32, cy: f32, r: f32 },
  Rect { minx: f32, miny: f32, maxx: f32, maxy: f32 },
}

// or in an impl Shape {}, it's exactly the same when compiled.
fn hit(shape: &Shape) -> bool {
  match shape {
    Shape::Circle { cx, cy, r } => ...,
    Shape::Rect { minx, miny, maxx, maxy } => ...,
  }
}

but now you need to know all of the shapes up front. But if this is stored in a Vec<Shape>, you might get even faster by storing distinct types separately:

struct Shapes {
  circles: Vec<Circle>,
  rects: Vec<Rect>,
}

fn hit_any(shapes: &Shapes, x: f32, y: f32) -> bool {
  for circle in &shapes.circles {
    if ... {
      return true;
    }
  }
  ...
}

This is because the arrays can now be stored more compactly without needing the type per item, and the space wasted in the Circle variant as it's smaller than the Rect, and also because the CPU is running the same code over and over without branching, which means it can predict the result much better.

As a note I wanted to verify this just to make sure I'm not just repeating nonsense, and I've been on a bit of a performance tear investigating this stuff today, and after far too much work generating a quarter decent benchmark I ended up with this result:

$ hyperfine ".\target\release\rust-scratch-2 {N}" -L N boxed,enum,soa --warmup 2
Benchmark 1: .\target\release\rust-scratch-2 boxed
  Time (mean ± σ):      3.310 s ±  0.036 s    [User: 1.262 s, System: 0.013 s]
  Range (min … max):    3.216 s …  3.345 s    10 runs

Benchmark 2: .\target\release\rust-scratch-2 enum
  Time (mean ± σ):      3.131 s ±  0.010 s    [User: 1.228 s, System: 0.009 s]
  Range (min … max):    3.109 s …  3.143 s    10 runs

Benchmark 3: .\target\release\rust-scratch-2 soa
  Time (mean ± σ):      2.927 s ±  0.007 s    [User: 0.981 s, System: 0.007 s]
  Range (min … max):    2.914 s …  2.936 s    10 runs

Summary
  .\target\release\rust-scratch-2 soa ran
    1.07 ± 0.00 times faster than .\target\release\rust-scratch-2 enum
    1.13 ± 0.01 times faster than .\target\release\rust-scratch-2 boxed

That is, using Box<dyn Fn()> is about 7% slower than using enum Shape {}, which is itself about %7 slower than using the final "struct of arrays" pattern. The moral of the story is don't worry too much about this, the performance difference for normal code is pretty tiny, and in is likely to be dwarfed by better architecture for your actual domain (eg. here, using a spatial collection like a hash grid or quad-tree), which should give order of magnitude improvements.

4 Likes

As I mention in more detail in How to avoid passing generics parameters through the whole system? - #3 by scottmcm , the main cost of indirect dispatch -- whether dyn Fn or fn -- is that it blocks inlining and thus keeps LLVM from removing redundancies.

Because of that, it's not really a big surprise to me that there's not that much of a difference between the two ways of doing virtual dispatch. Calling through a function pointer vs calling through a pointer to a function pointer just isn't going to be that different.

Also, Rust puts extra !invariant.load metadata on vtables (https://rust.godbolt.org/z/6PWM8ToaM), to help LLVM know that the vtable doesn't change, and thus it can optimize it extra-aggressively.

3 Likes