Don't fully understand tokio multithreaded runtime benefits

david_smith · June 24, 2024, 3:55pm

Hello there!

I don't fully understand the benefits of Tokio multithreaded runtime. Using this type of runtime, it looks like Tokio spawns each new async task within a new Thread, if there are some available Threads (that already done its previous work and are ready to execute new portion of work),
and default count of Threads equals to the count of CPU cores.

But if I limit the count of Threads manually to just 1 (by specifying #[tokio::main(worker_threads=1)]) and then measure the elapsed time of execution, for example, 500 async tasks, where each Task just sleeps 1 second, logs something into stdout and returns some value, I see that this elapsed time for these 500 tasks is the same (+/-) as when I ran the same 500 tasks using 10 threads.

I know that the Tokio provides a possibility to run blocking tasks using spawn_blocking, but, as I know, it uses the another ThreadPool for scheduling/running these blocking tasks.

And, in general, it looks like multithreading approach for execution of some amount of I/O non-blocking tasks doesn't provide any performance benefits.
May be I missed something?

zirconium-n · June 24, 2024, 3:59pm

Well. Real world application will still do some actually (short duration) computation. They are not there just to sleep. You don't really want to spawn_blocking all computation since it has a cost and is cumbersome.

cdbennett · June 24, 2024, 5:22pm

There are tradeoffs between the single-threaded and multi-threaded strategies for an async runtime.

I don't think it's necessarily true that "There is no benefit unless you do some IO, in particular network IO." Here's my reasoning: If you are 100% I/O-bound, with zero CPU (ideal / not realistic scenario), then you can do an infinite amount of work with a single thread.

But there is some amount of inherent CPU work even in I/O-heavy applications, to process and move data around, etc. before sending/receiving the bytes via the OS. And multi-threaded execution allows multiple CPU cores to share the load.

But, a multi-threaded runtime comes with some overhead in terms of synchronization and the need the make things thread-safe and added complexity in the scheduler, so it's likely in many simple cases that the single-threaded Tokio runtime will be faster than the multi-threaded runtime, but you'll have to test your specific application to see which will be best.

david_smith · June 24, 2024, 5:49pm

My question was just about using Async model together with multi-threading model. Because, for example, in Python the Event-loop never schedules a new Async task in separate thread implicitly, and Thread Pool attached to the Event-loop may be used only explicitly: when you need to run some blocking I/O task without blocking the whole Event-loop, you call the appropriate method (like a tokio::spawn_blocking). So, I mentally understand the Async model as single-thread mechanism where (almost) all activities are made in only one thread

david_smith · June 24, 2024, 6:01pm

Sorry, could you please explain why tokio::spawn_blocking is not so good?
And, If I want to run CPU-bound task(s) can I use spawn_blocking for this purpose?

farnz · June 24, 2024, 7:33pm

In Tokio, the event loop can have multiple threads for running async tasks. When you start an async task via tokio::spawn, it goes into the current thread's "ready to run" queue^[1], but if another worker thread is idle, it can "steal" tasks from other workers to make progress in parallel.

That way, you can have as many async tasks running in parallel as you have CPU cores to run them on; if one core is busy, another thread can take tasks away from it, and run them at the same time. This isn't worth the bother in Python, because of the GIL.

The thread pool used for spawn_blocking is awkwardly sized - it's meant for cases where the thread makes a blocking I/O call, and thus can't yield the thread back to the event loop. As a result, it's too big if you're trying to use it to run lots of CPU-bound work (since it has many more threads than you'll have CPUs), while still too small if you mix it with something like file I/O (which is normally blocking using the thread pool).

The recommendation if your workload has significant CPU-bound work to delegate to a thread pool is to use Rayon to provide the thread pool, and if needed, something like tokio_rayon to integrate Tokio and Rayon. Or, if you just need a thread to do work on, use std::thread to provide a thread of your own.

If the current thread's "ready to run" queue is already full, it can instead go into a global "ready to run" queue that's shared. ↩︎

david_smith · June 24, 2024, 9:11pm

Thanks for the explanation!

Yes, in Python we have the GIL, so threads cannot be run in parallel (when they execute python bytecode), but they can execute I/O bound tasks concurrently. Async model is alternative to process I/O bound tasks concurrently, but with less context-switch overhead and in more controllable manner.

In Rust, it looks like the Tokio mixes these two approaches together, so it confuses me, because:

we have threading context-switch overhead;
we don't have (am I wrong?) a lot of benefits of parallelism on Multi-core machine for an application that does I/O 90% of its time and the application threads just wait for the results of I/O operations most of its execution time.

jumpnbrownweasel · June 24, 2024, 9:20pm

Parallelism of CPU work is supported. If you don't have a lot of CPU work, you may not be able to take advantage of all the CPUs, but that is always true. What is the drawback of that?

david_smith · June 24, 2024, 9:48pm

I didn't say that this is drawback, I'm just trying to understand benefits of this, that should help me to use Tokio in optimal way

Also, as I can see, the futures lib for example doesn't use multi-threading to run many async tasks concurrently (I tried FuturesUnordered), and it's so interesting for me: will be Tokio more performant in real-world application: that do a lot of I/O and doesn't have a lot of CPU-intensive activities?

jumpnbrownweasel · June 24, 2024, 9:56pm

More performant than what?

I'll try to be more clear.

You're saying that you're trying to understand the benefits of tokio with multiple threads, and you think for some reason that it only has benefits if there is "a lot of I/O and doesn't have a lot of CPU-intensive activities". I don't know why you think that, since the more CPU work is done, the more benefit there is from multiple threads (assuming multiple cores as well).

So I'm guessing that you're comparing tokio with multiple threads to something else, and I don't know what that something else is.

jumpnbrownweasel · June 25, 2024, 4:48am

Just a nit: You can configure the size of the blocking thread pool.

paramagnetic · June 25, 2024, 5:57am

You can trivially do on many threads what you can on a single thread. The runtime will spawn threads if it sees fit. There's no need for additional synchronization between tasks because async tasks are already required to be thread-safe at the type level, even in a single-threaded setting, since runtimes can potentially be multi-threaded.

What's wrong with any of that?

david_smith · June 25, 2024, 6:07am

I meant using tokio::spawn VS using FuturesUnordered on tokio multi-threading runtime.

farnz · June 25, 2024, 10:21am

With FuturesUnordered, you're only executing the contained futures when the container (FuturesUnordered) is polled. This has some foot-guns where the futures do I/O with wall-clock timeouts applying, and you fail to poll the FuturesUnordered frequently enough.

With spawning tasks onto Tokio's thread pool, the goal is to give you the best of all worlds; if your workload is completely I/O bound, and only one task is ready to run at a time, it runs single-threaded, and you don't pay any migration overhead to move tasks between threads. If, however, your workload would benefit from running the CPU components of your tasks on different CPU threads, then Tokio will migrate some tasks onto another thread transparently for you.

Thus, if you're using the multithreaded runtime, but your workload fits nicely in a single thread, only one worker thread is actually running, and the other workers are parked; as soon as you have a worker thread that's busy, and a second task that's ready to run, Tokio will move the second task from the busy worker to a parked worker and wake it up, so that both tasks can run in parallel. Python can't execute code in parallel due to the GIL, so it can't do this, even though there are performance benefits from doing so.

The key insight behind async is that threads aren't a great abstraction if you want the thread to spend most of its time asleep; async tasks are designed to be cheap to put to sleep and wake up later. But threads are the best abstraction we have if you want to use multiple CPU cores at once, so if you have enough work for multiple CPU cores to benefit you, you want to use threads for the tasks that aren't sleeping.

david_smith · June 26, 2024, 7:44pm

Thanks! I had read the original issue related to the FuturesUnordered foot-guns on github and related articles. It looks like described problem comes into play upon using buffered stream of futures, because the .buffered(N) call on stream limits the amount of futures that can be started concurrently to the number of N.
Also I saw in all these examples that for_each({...}) is used (instead of .for_each_concurrent(limit, {...})) and there is the second bottleneck.
In general, when I asked before about tokio::spawn vs FuturesUnordered performance in previous comment, I meant more general use-case of FuturesUnordered where buffered stream limited to some N doesn't come into play at all.

farnz · June 27, 2024, 10:36am

This is a misunderstanding - the problem comes into play when you don't poll the FuturesUnordered frequently enough, it's just that the most obviously sensible way to get into this mess is with something like .buffered or .buffer_unordered, since that's "clearly" the right way to ask for N things concurrently, but no more.

But you can get into the mess with just FuturesUnordered: consider the following code:

async fn establish_tls_connection(host: SocketAddr) -> Result<TlsConnection, TlsError> {
    let socket = TcpStream::connect(host).await?;
    let server_cert = socket.tls_hello().await?;
    let master_secret = socket.client_key_exchange(server_cert).await?;
    socket.establish_ciphering(master_secret).await
}

fn bad(hosts: &[SocketAddr]) -> Result<(), TlsError> {
    let tls_sockets: FuturesUnordered<_> = hosts.into_iter().map(establish_tls_connection).collect();

    while let Some(socket) = tls_sockets.next().await? {
        do_something_that_sometimes_takes_10_minutes(socket).await
    }

    Ok(())
}

This has the problem, despite just using FuturesUnordered. If the remote hosts have a timeout of (say) 90 seconds to establish a TLS connection before they close the socket, we can easily have a situation where some of the hosts are awaiting in establish_tls_connection, but ready to be polled, while bad is not polling tls_sockets because it's awaiting in do_something_that_sometimes_takes_10_minutes. Because bad is not polling tls_sockets, the remote host timeouts can expire for no obvious reason - CPU is idle, it's just that tls_sockets started a future that it then didn't poll to completion (because it had a result to return from next), and the while loop didn't then come back into tls_sockets.next().await for a long wall-clock time.

It's a really subtle foot-gun, because it depends on two things that individually don't look nasty:

You have an async fn like establish_tls_connection that has a wall-clock time limit imposed on it, but that doesn't spawn anything.
You have a while let loop or similar construct draining the FuturesUnordered as fast as possible, but where "as fast as possible" involves a lot of hidden wall-clock time.

There's workarounds - for example, in this case, you could change the map(…) to map(|host| async move { tokio::spawn(establish_tls_connection(host)).await.expect("establish_tls_connection task panicked") }), so that the FuturesUnordered just contains JoinHandles, and doesn't do the spawn until the last minute - but it's not always possible to use a workaround.

david_smith · June 28, 2024, 7:04am

Thank you for the detailed explanation!

As I can see, the example you provided is similar to using buffered(N) whe N = count of the hosts and the common problem is:
the next one socket cannot be retrieved because the appropriate future is already woken up but it's blocked in current moment by the awaiting of do_something_that_sometimes_takes_10_minutes() called for the previous socket, right?

And It looks like if I rewrite the second part of your example: polling establish_tls_connection futures in while loop with calling do_something_that_sometimes_takes_10_minutes(socket) on each loop's iteration, with something like:

tls_sockets
    .try_for_each_concurrent(hosts.len(), | socket | { 
        do_something_that_sometimes_takes_10_minutes(socket) }
     ).await;

this common problem will be solved.

Please correct me If I'm wrong

farnz · June 28, 2024, 9:22am

The try_for_each_concurrent code has a different problem - the intent of the while let loop is to run all of the establish_tls_connections in parallel, while only running one do_something_that_sometimes_takes_10_minutes at a time. try_for_each_concurrent, however, will run do_something_that_sometimes_takes_10_minutes in parallel, which can cause resource starvation issues (e.g. if do_something_that_sometimes_takes_10_minutes takes up 32 GiB RAM per call, and you have 64 GiB RAM, you're in a lot of trouble when I try to run 3 in parallel).

If do_something_that_sometimes_takes_10_minutes was cheap apart from wall-clock time, this would be a good workaround; it's equivalent to changing the original bad to:

fn bad(hosts: &[SocketAddr]) -> Result<(), TlsError> {
    let tls_sockets: FuturesUnordered<_> =
        hosts.into_iter()
             .map(|host| async move {
                  do_something_that_sometimes_takes_10_minutes(
                      establish_tls_connection(host).await?
                  ).await 
              })
             .collect();

    while let Some(socket) = tls_sockets.next().await? {
    }

    Ok(())
}

When you write it this way, it's now clear why there's no issue - the while let loop is empty, so it can't take a lot of wall clock time - but you've traded that for many more concurrent calls to a possibly expensive function.

system · September 26, 2024, 9:23am

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Tokio: using core threads for cpu-heavy computation help	19	4569	January 27, 2023
Async threadpool help	4	1510	May 28, 2023
Tokio use advice request help	7	858	August 9, 2020
Single threaded multitasking vs Multihreaded multitasking (Tokio) help	3	636	March 10, 2023
Unexpected behavior of tokio multithreaded runtime with a single worker thread help	5	98	January 10, 2025

Don't fully understand tokio multithreaded runtime benefits

Related topics