Performance of FuturesUnordered v.s. spawning in tokio

To execute futures concurrently, I can either do tokio::spawn or use FuturesUnordered. My question is that how is the performance comparison between these two.

It seems that spawning is more expensive as it creates more tasks. But I also found this issue saying some footguns with FuturesUnordered.

1 Like

The internals of FuturesUnordered are relatively similar to the internals of Tokio, and the cost of spawning on each is not that different.

2 Likes

This is really interesting. I've been wondering about what scheduling policies tokio and other async engines use in Rust. Could you describe these policies some more?

Are futures polled in random order? FIFO order? No guaranteed order? What thought has gone into these policies so far?

1 Like

Tokio does not guarantee anything about the polling order, and it can change at any time, but I can describe the current implementation:

Each worker thread in the runtime has a local queue of tasks with a max capacity of 256 notifications. It also has a global shared queue with unbounded capacity, and finally each worker has a LIFO slot.

If you wake up a task from outside the runtime, the notification always goes to the global queue. If you wake up a task from one of the runtime's worker threads, the runtime will put that notification in the LIFO slot if the slot is empty, unless a task woke up itself, then the LIFO slot is skipped. Otherwise it will attempt to put it at the back of the local queue. If the local queue is full, the new task, and half of the tasks in the local queue are pushed to the global queue.

When the runtime needs to find a new task to poll, it will usually take from the local queue and if empty, from the global queue, however it will occasionally swap this to ensure tasks in the global queue are eventually polled. If neither queue has anything, it will attempt to steal tasks from the local queue of other workers. Once it found a task to poll, it will set up a coop budget and poll the given task. After polling that task, it will keep polling whatever task is in the LIFO slot until the coop budget is used up or the LIFO slot is empty. Once the coop budget is used up, the LIFO slot task is moved to the local run queue, and it all starts over.

In all of this, spawning a new task behaves identically to waking up an existing task.

The LIFO slot is a very effective optimization for things that send messages over channels, as it means the receiver gets to run very quickly after.

5 Likes

Interesting. I'll point at a related piece of previous work here by von Behren et al: Capriccio, in particular, see Section 4 on resource-aware scheduling. (Capriccio was an async userlevel cooperative threading system before "async" was invented in today's usage.)

In this approach, the runtime system tracks blocking points (which is .await points in async terminology) by identifying the calling context via its stack trace, and then the scheduler decides which blocking points' tasks should be scheduled next based on resource usage.

The idea is that in order to maintain high throughput (and avoid tail latencies), you generally want to prefer threads (futures in async terminology) that are "close" the end of their processing pipeline (say completing an http transaction). The scheduler has no upfront knowledge of which blocking points it should prefer in this way, so it heuristically determines those by looking at a tasks's resource usage. For instance, a task's resource usage decreases when it's closing a file descriptor, which lets the scheduler infer it's near the end of its pipeline. This is claimed to lead to an admissions control effect where futures tasks who have already started processing are preferred to those who represent new clients or requests.

There are of course a number of differences. First of, tokio is multi-threaded so must also deal with load balancing among workers whereas Capriccio used a single core only; second, it's not clear how well the blocking graphs would be identifiable in today's workloads and whether they'd lend themselves to the analysis done by Capriccio (for instance, it's not clear how persistent connections or even QUIC would be recognized).

Still, rather than relying on a "coop budget" the idea of the scheduler learning which blocking points (await points) it should prefer is intriguing. It would need to be compared to simpler approaches, such as letting the application hint which futures the scheduler should prefer.

4 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.