Compilation model for rust async function calls across crates?

My understanding is that in Rust coroutine frames by default are stack allocated -- in other words when you call an async function foo without awaiting it yet, you're creating a temporary on the stack that is the size necessary to store the local variables of foo and any other async functions that it calls, as well as the data needed to track which await point it is currently frozen at, transitively all the way down.

In order to do this the compiler needs to know how much space the local variables take up at the call site. This is different from a regular (sync) function where calling only requires knowing the signature (so the arguments can be passed in the right registers according to calling-convention/ABI) and the address to jump to. This means in langs like C/C++, the function signature that goes in the header is sufficient information to begin compiling a call to a sync function, even if the sync function itself hasn't been compiled yet. So two libraries (two distinct compilation units) can be compiled in parallel even if they depend on one another and call each other

But in Rust, it seems like the amount of space needed for locals could possibly vary based on optimization passes (e.g. eliminating unused local variables making the frame size needed smaller). This means Rust would really need the callee to be completely compiled before compiling the caller. And typically in Rust crates are the compilation unit boundary (assuming codegen-units=1), so this would mean if crate A calls an async function in crate B, that crate B must be compiled first, not in parallel with A. Is this correct?

2 Likes

This is true, but I think it is not a good way to think about it because it’s not derived from the actual fundamentals of what is happening. Instead, think about it this way:

  • async fns are syntax sugar for fns that return values of anonymous Future types.
    • (When generators and other kinds of coroutines are supported, the same thing will happen except for using a different trait.)
  • This anonymous Future type is the “coroutine frame”, and local variables of the coroutine are the fields of that type.
  • When a large value is returned from a function, the default outcome (in the absence of any optimization that changes this) is that it is placed in the caller’s stack frame.
  • Therefore, Futures that are coroutines typically end up in the caller’s stack frame temporarily.

There is no special answer to “where are future/coroutine values put”; it is exactly the same as any other value in Rust.

In order to do this the compiler needs to know how much space the local variables take up at the call site.

I would say rather: the compiler needs to know what the return type of the function is. Once it knows the type, it will know the size of values of that type. The fact that the return type is an anonymous Future type doesn’t make the process different from compiling other function calls.

And typically in Rust crates are the compilation unit boundary (assuming codegen-units=1), so this would mean if crate A calls an async function in crate B, that crate B must be compiled first, not in parallel with A. Is this correct?

In order for the function in A to call the function in B, the return type of the function in A must be available. However, cargo+rustc supports pipelined compilation: the compilation of A may start once the compilation of B has emitted its metadata (information about types, traits, and other items), and need not wait for B to finish code generation. So, even with codegen-units=1, you may achieve quite significant parallelism between A and B (depending on how long the codegen phase for B is relative to the whole compilation of A).

The command cargo build --timings will generate a report with a dependency graph that illustrates this parallelism.

All of this is true regardless of whether the functions are async or not. The only difference is that if they were not async (or if they returned type-erased, heap-allocated Box<dyn Future> of a fixed size) then you could choose to declare them as extern functions, not declare any dependency between A and B, and just rely on getting the signature right in both crates. Then, you've achieved complete parallelism between A and B by using the C paradigm for compilation and linking (each compilation unit requires pre-written “header” information for the others) instead of the Rust one. However, this is not usually done, or practical to do, since it is prevents using many Rust features, like generics.

5 Likes

I appreciate the first principles approach to how to think about it, it makes perfect sense that async fn return values are not special, they're just following how returns of other types work.

But how does this deal with the fact that the size can't be knowable until after codegen? Surely how much stack space is required actually depends on whether the backend detects and eliminates unused locals, or reuses space for locals that don't overlap in lifetime? In other words the sizeof the compiler defined future type can't be known until after codegen, AFAICT.

But how does this deal with the fact that the size can't be knowable until after codegen?

The codegen backend doesn’t get involved in choosing what variables need to go in a coroutine type at all. And yes, this means that it works less well than ordinary stack variable allocation, because there aren’t nearly as many or as powerful optimizations that are applied in those earlier stages than there are in the backend. In practice, if you have a let variable in an async fn, it's probably going to show up in the generated future type even if it doesn't have to. For example:

async fn foo() {
    let x = 1; // this is in the coroutine...
    println!("{x}");
    std::future::ready(()).await; // ...because it is carried across this await
    println!("{x}");
}

fn main() {
    // size = 8 (1 discriminant + 1 i32) even though 1 is all that is needed
    println!("{}", std::mem::size_of_val(&foo()));
}

This means that it can be a useful manual optimization to move code out of async fns into ordinary fns (when it is the code between two awaits), to avoid the caveats of the current implementation of the async transform. (And there are also cases that aren’t compiler deficiencies but required by the language — for example, dropping a large value earlier, whether by drop() or by adding a {} block, can be particularly useful in async, because everything else has to be dropped at the end of scope by the rules of the language, the end of scope for an async fn might often be a lot later than for a fn, and they also affect whether the future is Send.)

3 Likes

Disclaimer: I don't have expert knowledge over how compiler-generated futures are generated; my understanding is based on RFCs and reading compiler issues and the like.

You can consider the future to be a complier generated enum, where the variants are states of a state machine. It captures the required values as fields in each variant. Then the answer is mostly the same as "how it does it fit other enums" again: it's a layout optimization problem.

There are at least two caveats. The first is that this really just moves the codegen optimization to before the struct definition: you want to minimize captures, perhaps be more than an enum in terms of possible state flow, etc. But this is an optimization to decrease size, not a necessary step in order to have some size at all.

The second, though, is IMO what your question is really about:

It is true that the notional enum depends on the function body, not the function signature. That is also true of all Rust functions that return impl Trait (which is also what async fn does "under the hood"). And this does mean certain functions need to be analyzed to some extent before their callers can be fully compiled.

Incidentally, it's not just the size that "leaks" -- auto traits like Send and Sync and UnPin also leak, and can be relied upon by the caller. So other things like trait bound checks also need to know the function-body defined type.

What I've written is still not the whole story. Because an async fn captures the state of the entire function body, it can't handle recursive definitions (the type becomes infinitely sized even if the future itself does not). But you can sometimes work around this by writing a non-async fn (..) -> impl Future<..> with more refined async {} blocks.

Edit: See the next comment, the compiler has gotten smarter in this area.

This can happen with -> impl Trait too, but takes more gymnastics so is probably less common.

A playground demonstrating some recursions and see also.

3 Likes

You don't need to resort to non-async fn, or type erasure, any more as of Rust 1.77; the compiler is now capable of type checking recursive async fns. You only need to introduce allocation/indirection to prevent the actual infinite size: Box::pin(foo(false)).await in your example code.

4 Likes