Monorphization vs Dynamic Dispatch

Going through the book that @steveklabnik himself mentioned, I've stumbled upon the explanations of expandable generics and dynamic dispatch, as follows:

Monomorphization also comes at a cost: all those instantiations of your type need to be compiled separately, which can increase compile time if the compiler cannot optimize them away. Each monomorphized function also results in its own chunk of machine code, which can make your program larger. And because instructions aren’t shared between different instantiations of a generic type’s methods, the CPU’s instruction cache is also less effective as it now needs to hold multiple copies of effectively the same instructions.

and

Dynamic dispatch cuts compile times, since it’s no longer necessary to compile multiple copies of types and methods, and it can improve the efficiency of your CPU instruction cache. However, it also prevents the compiler from optimizing for the specific types that are used. With dynamic dispatch, all the compiler can do ... is insert a call to the function through the vtable—it can no longer perform any additional optimizations as it does not know what code will sit on the other side of that function call. Furthermore, every method call on a trait object requires a lookup in the vtable, which adds a small amount of overhead over calling the method directly.

concluding with:

Dynamic dispatch often allows you to write cleaner code that leaves out generic parameters and will compile more quickly, all at a (usually) marginal performance cost, so it’s usually the better choice for binaries.

My question is: just how marginal is that performance cost in reality?

3 Likes

I have spent most of the day thinking about this, thanks for the quotes. My conclusion is use dyn dispatch unless dyn dispatch calls (also heap allocations) are frequent and there is benefit from the optimisation that monomorphization offers. I guess you also have to consider how many different monomorphization copies will be made. So make pragmatic engineering decisions.

Here's someone who experimented with it. Their results are probably worse-case in a sense, as almost all the program does is dispatch. If you're dispatching in a hot loop, it's going to matter a lot more than dispatching occasionally to methods where the work within does a lot of work, or needs to block, etc.

But when it comes to optimization, pretty much every suggestion has an asterisk on it: Always measure your specific use case to be sure. There are also ways to mitigate some of the general trade-offs.


For example, if you have something like

fn f<T: AsRef<str>>(s: T) {
    let s = s.as_ref();
    // ...rest of body...
}

This may result in many copies of f. Or, LLVM may be smart enough to deduplicate the rest of the function body, and you'll have some sort of switch at the top and the body following. Or, it might be smart enough to do the transformation before calling the function and you only have one instance of f. And if LLVM isn't smart enough, you can do it yourself in cases like these (put the rest of body in a non-generic function and call it from f).

Similar caveats apply to dynamic dispatch; sometimes compilers are smart enough to "devirtualize" them and see through a vtable. Or as in that article, remove some of the indirection of vtables in some circumstances. (And they had to work around some optimization to even get compiler output that measured what they wanted to.)

8 Likes

Thanks for the link! Although I do understand it's not representative of all the cases, these numbers at least provide some general guidelines:

The position of the decision, which implementation to use, has a significant impact on performance: If the decision is placed outside the loop, we observed an increase in runtime by factor 1.2, if it is placed in the loop, the increase was of factor 3.4.

Another rule of thumb seems to be: avoid dynamic dispatching when iterating over a sequence of objects, if you can - that's definitely good to know.

And of course - measuring the specific case is always the best option, when it's needed.

Easier said than done at times, isn't it?

1 Like

It's a super-difficult question to answer in general, because most of the runtime cost is not actually in the dynamic call, but in the opportunity cost of other things. (Especially if in practice it's usually going to the same place, when modern desktop branch predictors will make it essentially the same as a static call.)

Basically, it's that it hides context. If that context was important, it can matter. If it wasn't important, then it doesn't matter.

Let's imagine an image processing application for an example:

  • If you did a dyn call to read every pixel, it'd probably matter -- both because the cost of reading the pixel is low already, and also because it would likely prevent optimizations from removing or simplifying bounds checks based on context.
  • But you probably should do a dyn call for each filter -- blur, edge detect, etc. Because there the context isn't important, being able to compile them separately is helpful, and registering/dispatching them based on something like a HashMap<String, Box<dyn Filter>> is quite handy.

So like threading or allocation or many other things, it's all about knowing what your granularity is. You don't want to read every byte of a big file through a dyn Iterator<Item = u8>, but you also don't want to bother trying to statically-dispatch every single endpoint in a web server.

8 Likes

My take would be too use whichever is easier to code, unless you're building a library crate, in which case always use static dispatch, because you have no way to know if your code may someday be performance critical to someone.

1 Like

It's hard to generalize. There are also cases where monomorphisation reduces code size:

  • a generic call to a simple getter, like vec.len(), is going to optimize down to a single field read, and that's less code than a dyn invocation.

  • if a generic function is called only once, it's just going to be a regular function call, and the compiler will be able to eliminate unused/constant parts of that function. It's going to be the same or smaller than the same function invoked through a layer of abstraction.

There are double-edged interactions when using dyn and Future.

  • (impl Future).await inlines the awaited future with the outer future that is awaiting it. The upside is that state machines of both are likelty to be inlined and merged together, which usually makes poll cheaper and use less stack space. The downside is that the Future object itself is going to be larger due to carrying combined state of both. If you accidentally make one combined Future for your entire program, it's going to be hugely bloated and even cause stack overflow at the moment of spawning it (not polling).

  • (dyn Future).await typically makes the outer awaiting Future struct smaller compared to (impl Future).await, because it inlines only a Box pointer in its state, rather than merging in all of awaited Future's state. OTOH lack of the merging of states means poll of the awaited future won't be inlined. This increases stack usage during poll, and makes cost of polling O(n^2) where n is the number of nested dyn Futures awaited.

8 Likes

You can use cargo-bloat to see if generic instances of functions add up to a problematic size.

2 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.