Code size and compilation time tips for async?

I've read broad claims that async fns are much more expensive to compile than fns. This makes sense because they must do work at least equivalent to moving all their arguments into a closure and returning the closure (that is, returning a not-yet-started Future). I've even found a substantial improvement in code size by de-asyncifying a set of functions that turned out to never have any need for it.

But in general, are there noteworthy code-size/compile-performance techniques for async fns?

I've thought of some dubious ideas:

  • Would making async fns call inner fns to do non-async parts of the work improve anything over the same code inline? (I'd hope the compiler can see this simple case itself.)
  • I could watch carefully for the pattern async { doesnt_need_awaiting() } and make sure to replace it with std::future::ready(doesnt_need_awaiting()), where the change in evaluation order is acceptable. But that doesn't come up very often.

Are there any better ideas? Articles? What's the min-sized-rust of async?


I would expect a lint to exist for this. In the JS world, eslint will complain if you use an async function without awaiting anything.

There is redundant_async_block but that is for async { foo().await } only, and unused_async but that is for async fn only. Note that the semantics are different between the two cases I showed: they change when doesnt_need_awaiting() is called (and potentially on what thread).

Also, let's please not dig too far into that. I'm looking for tricks that I haven't already thought of (and yes, I've looked at all the available Clippy lints).


You're probably doing all of this already, but just in case.

  1. I try to avoid mixing generic code & async. So I will do things like:
    fn spawn_local(f: Pin<Box<dyn Future<Output = ()>>>) {

Yes, sure there is an extra heap allocation and a tiny performance hit. However, I don't mind paying this every time I spawn an async.

  1. I believe async fns get transformed into state machines (based on .await) and then stored as an Enum. This sounds like a transform whose running time is atleast linear in both (1) # of .awaits and (2) size of function. I can't do anything about the # of .awaits, but I can reduce the overall function size by refactoring out all the non-await function into separate functions. I think this is similar (but perhaps more explicit) to your inner fn approach above.

  2. I never explicitly benchmarked this, but iirc, cargo build --timings improved 50% ish as I reduced async usage.

  3. I try very hard to keep all async-await at the "top level", i.e. instead of letting async recursively pollute functions, I try very hard to do

spawn_local(Box::pin(async move { 
  loop {
    let inputs = blah.await;
    let outputs = non_async_process(inputs);

I don't know a nice way to state this rather than "don't have async pollute everything; try to 'push up' async to the top"

1 Like

If you take a look at the current implementation of wasm_bindgen_futures::spawn_local, it is:

pub fn spawn_local<F>(future: F)
    F: Future<Output = ()> + 'static,

so, they already immediately box the future (in general, any async executor's spawn must do something like that to type-erase the future) and your wrapper is adding another Box but not significantly reducing the amount of monomorphized code.

I can't do anything about the # of .awaits, but I can reduce the overall function size by refactoring out all the non-await function into separate functions.

That's one of the ideas I mentioned in my original post. I want to know whether this actually helps in practice; it seems like a trivial transformation that the compiler should see through whichever way you write the code. (The compiler does already analyze whether each let variable needs to be in the future; it has to, or "non-Send value is used across an await" errors would end up being be "value is used in this future" errors, and future types would be potentially quite unnecessarily large.)


Unfortunately it is not so trivial Tracking issue for more precise coroutine captures · Issue #69663 · rust-lang/rust · GitHub

1 Like

I'm conceding this point. In light of this evidence, my 'optimization' is pretty silly.

I wrap async calls on less-hot paths in Box::pin. This prevents inlining of its state into the parent Future, which reduces size of the state machine. Smaller Futures use less stack space, less copying, and probably generate smaller code for the poll methods, although I haven't checked whether that's net positive for code size given extra alloc code.

// lots of code
let result = Box::pin(async_call()).await;
// lots more code

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.