Can dynamic dispatch be optimized away?

It seems that the compiler is able to remove dynamic dispatch when calling a method on a trait object if it is called for only one type implementing this trait. This is what I infer from a benchmark I wrote here (using criterion).
Basically if you uncomment at least one of the commented code block the performance decrease a lot.
I am correct? Is there more to this?

Yeah this is depending on "const unfolding" (the dynamic vtable pointer could be replaced with a constant one for each call done on a specific type), and in this instance, on whether call_dyn gets inlined or not; you may want to play with that


Can it? Yes of course. When is it? This is up the whims of LLVM, which has many optimizations that may or may not apply in each situation for complicated reasons. Honestly in this case I would have guessed that it is optimized away no matter how many impls you have since the function is so small, and hence likely inlined.


It's always possible to de-virtualize indirect calls and turn them into jumps by basically matching on the vtable pointer itself:

if vtable_ptr == &VTABLE_FOR_MyTrait_String {
    <String as MyTrait>::foo(instance_ptr);
} else if vtable_ptr == &VTABLE_FOR_MyTrait_u32 {
    <u32 as MyTrait>::foo(instance_ptr);
} else if … {
} else {
    // fall back to dynamic dispatch if no known type matched

And then the concrete, specialized functions can even be inlined, and branch prediction can kick in, or dead code elimination can remove the code altogether, and the whole call can then be as efficient as if it were dispatched statically.

Don't count on it though – optimization is hard, and it usually involves ad-hoc heuristic. This means that as little as adding a single line to a function can cause an arbitrarily-specified threshold inside the optimizer to be crossed. The compiler then sometimes just throws its hands up in the air and goes "nobody ain't no time for that", and ceases to optimize.


I'm not sure if it's always possible to enumerate all vtables at compile-time, since the crate being compiled might not always know a full set of trait implementations.

To clarify, I meant that it can always be done as a "best effort" solution. The final else arm would just fall back to actual dynamic dispatch.


This is pretty much what you do in Profile Guided Optimisation. You run a special version of your binary which records information about the functions being called, then the compiler can use that information to generate vtable_ptr checks for all the really common cases, falling back to dynamic dispatch otherwise.