Can someone explain how runtimes for async and await work on an assembly level?

Hello,

I am following the book Futures and the Async Syntax - The Rust Programming Language
and have a question on runtimes:
when async or await is called then the code will issue a procedure that does not require CPU time. The compiler will need to issue instructions to manage this procedure. What is meant by the state of the async block? What does it encompass? And how is the result of the procedure returned on an assembly level? Thanks!

Each await point —that is, every place where the code uses the await keyword—represents a place where control is handed back to the runtime. To make that work, Rust needs to keep track of the state involved in the async block so that the runtime can kick off some other work and then come back when it’s ready to try advancing the first one again. This is an invisible state machine, as if you’d written an enum like this to save the current state at each await point:

The state of an async block is basically all the variables you create and capture in it. This data needs to be stored as part of the future, because when you hit an .await point, the execution of the future might get suspended. If the future gets suspended and then continues execution at a later point, you need access to that state, i.e. it can't have been dropped in the time the future was suspended.

Here a good write-up IMO on how futures work under the hood:

3 Likes

thanks!

It's large and very detailed, but 99% of time a simple “async function is a function with a call frame not on the stack” is sufficient.

At least if your goal is rough understanding of how is the result of the procedure returned on an assembly level.

Start with the normal code with a stack, stack pointer and, importantly, a frame pointer. How can we turn that function into a stoppable and resumable coroutine?

Well… it's easy, at least conceptually: just rip out the stack frame and put it somewhere… not on stack, just in some region… and call that region “future”. Bam: done, now you have function that can be called later, stoped and resumed. In addition to our “future” (former “function stack frame”) you only need pointer to the “currently executing instruction” beside that “future” (or, perhaps, as a slot in that future).

When you call await – “currently executing instruction” is recorded and execution is transferred to “executor”, when executor wants to resume the execution – it could just jump on that address.

Of course to actually have usable async you need many more details, you need some way of notifying the executor about the need to “wake” some coroutine (otherwise your only hope is to wake them as much as you can to see if, maybe, they would do some useful work… not too much useful if your goal is working program and not a replacement for space heater), you need some low-level API that would allow you to call the OS (some OSes have the required mechanisms but most only have a handful of useful syscalls, mostly networking-related which means in practice async is much less useful than people think) and so on (there are lots of details, read the article if you want to know about them).

But the core idea is just that simple: “rip out” the call frame of the function from stack, put it in the “future” and add “currently executing instruction” slot to it.

P.S. I just wish they would make coroutines without all that async machinery available on stable, at some point. Because most OSes are not asyncronyous pretending that everything around you is asyncronous just adds complications without many real benefits… but alas, async is currently hot buzzword, while coroutines were “hot” decades ago. That's why we got async first and much more useful (and simpler!) coroutines mechanism without all that extra baggage is still unstable.

This is not how Rust's async work at the assembly level though. What you're describing is how a stackful coroutine is implemented, but Rust's async is a stackless coroutine.

1 Like

No. I'm describing precisely how stackless coroutines work. They don't use stack, there's only one “torn out frame” and that's why stackless coroutine couldn't call another stackless coroutine.

I would say it's more of stackless coroutine (so very close to what I describe) plus a clever trick that embeds one stackless coroutine into another stackless coroutine to create an illusion of stackfull coroutine: when one async function calls another Rust combines them into a bigger coroutine, thus creating an illusion that there are stackfull coroutines in play.

That's why there are no recursion with async, e.g. (without special tricks that explicitly introduce yet another, separate, future).

If you are right with what you said - the post above yours disagrees with you, your explanation is sufficient for me to understand how async works.

From what I understand there were some issues with implementing approach that I described precisely as it was described (some LLVM limitations, apparently), but I don't really know of anything that compiler that would do what I described would do significantly differently from compiler that exists today (with the obvious difference that code generated would be slightly different – but with the same observable effects).

If you read the description (and code) from the article that @jofas linked then you'll see that the main difference lies in the fact that instead of storing program counter directly Rust stores the the ID of basic block that called await and have a jump table in the encoded state machine. I have no idea why it was done in that, less efficient, way – but it acts exactly like coroutine created by “tearing out” normal call frame from the stack would act.

Maybe they were concerned that storing arbitrary address and then jumping on it would be problematic WRT to CFI, or maybe they were afraid of anti-virus software, or maybe LLVM is just not flexible enough to generate something like that (perhaps it's impossible to force it to never assume that stack pointer also points to the current stack frame?)… or maybe they needed to do what they did to be able to do the aforementioned “merging many stackless coroutines into one big coroutine to emulate stackfull coroutines”… IDK the exact reason… but the end result is: instead of “simply tearing out frame from the stack”… they do exactly the same thing – but manually.

1 Like

sounds convincing :slight_smile: