Closure `impl Fn` performance

I'm aware that function pointers with dynamic dispatch can reduce the performance of programs. Whilst the details of this are processor-specific, it seems generally agreed that these calls usually incur at least some extra cost.

Conversely, Rust's (relatively) new impl Fn(...)s seem to be implemented using static dispatch so I'm interested whether they incur the same cost (or indeed any as opposed to regular function calls)? As a simple but nonsense example, let's say I use a function that returns a closure...

fn foo(x: i32) -> impl Fn(i32) -> i32 {
    move |y: i32| {
        x + y
    }
}

fn main() {
    println!("{}", foo(3)(4));
}

...will I suffer worse performance than doing something like...

fn bar(x: i32, y: i32) -> i32 {
    x + y
}

fn main() {
    println!("{}", bar(3, 4));
}

...and if not, does this apply in the general case? Thanks!

impl Fn is actually the same as T: Fn() except it can be returned. So we can actually look at the general case of T: Fn() when looking at this kind of thing.

There are four types of non-mutating multiple-call function pointers/closures:

  • dyn Fn() you seem to have already discovered that this is the least efficient of them all, since there is the runtime cost of dynamic dispatch. This occurs because dyn hides everything about the underlying type, only exposing the appropriate vtable (the list of associated functions for that type).
  • fn() is a function pointer. This actually obscures the underlying function from the optimizer so you can't optimize away this function call sometimes.
  • fn() { <name> } is the type of the specific function named name. This allows you to do optimizations as if you'd just called the function directly, without any kind of function pointer nonsense.
  • Closures with state. These are tricky, so I'll point you to @KrishnaSannasi's great article about closures.

Since these all implement Fn, they all get optimized differently depending on the way it's used, meaning using impl Fn doesn't really change the way the optimizer behaves, since it really only observes these four cases.

2 Likes

god says no worse performance.

Not sure there is a general case. If you intend to call a closure multiple times it can be quicker; if the argument added in the single function case has to pass through validation.

2 Likes

The constant-folded case is probably not so helpful to compare.

But with non-constant inputs it looks to be the same.

BTW, I had to invert the argument order in f, otherwise the compiler noticed that the functions are completely identical and optimized f out completely! :slight_smile:

becomes

/// `Fn(i32) -> i32` unsugared
trait MyFn {
    fn call (self: &'_ Self, _: i32) -> i32;
}

fn foo (x: i32) -> impl MyFn
{
    return Captured { x };
    // where
    struct Captured { x: i32 }
    impl MyFn for Captured {
        fn call (self: &'_ Self, y: i32) -> i32
        {
            self.x + y
        }
    }
}

fn main ()
{
    println!("{}", foo(3).call(4));
}

Except that the method is not "called" .call(...) but simply (...), since that's the sugar that the Fn traits provide.

And now, the question is whether the code using MyFn and the code that computes the sum directly are "equivalent":

First, if we look at the foo() function, it is taking a i32 and returning a struct with a single field containing that input, and nothing more. So the foo() function is just a "cast"/ a no-op.

Similarly, if we look at the .call() function, it is actually just the .add() function for a i32
(since Self ≈ i32)

use ::core::{convert::identity, ops::Add};

fn main ()
{
    //                    foo         call
    //             vvvvvvvvvvvvvvv    vvv
    println!("{}", identity::<i32>(3).add(4));
    // i.e., after inlining no-op functions:
    println!("{}", 3 + 4);
}

So with a minimum amount of optimization (counterexample) the "no-op" functions can be inlined and thus skipped, making both codes equal.

1 Like

This is all really interesting, thank you for all the insight!