Why shouldn't I use Tokio for my High-CPU workloads?

I have a workload that looks like this:

  • Roughly 10,000 tasks, defined as async functions.
  • Each task may arbitrarily take anywhere from 10 nanoseconds, to 10 seconds
  • Tasks are defined as black boxes by users of the library
  • These tasks are heavily CPU-bound. Think systems in a game engine.

I am not concerned with latency on any one task, just with finishing all work in the shortest total time.

The Tokio docs (and many other posts) mention that Tokio isn't suitable for this type of workload. It explains that High-CPU tasks will block other tasks from running. However, this seems to contradict this Tokio blog post which describes Tokio as a work-stealing scheduler. If there are free cores with no tasks being run, they should steal work from busy processors.

So why the constant recommendation not to use Tokio? Is it because normally you have other latency-sensitive tasks that you want to make progress (ie. a web server?). Given that I don't care about latency, it seems like Tokio's behavior is exactly what I want.

I've looked at Rayon, but it doesn't seem to be an async runtime, so it's not clear I'd be able to execute arbitrary async functions on its threadpool. Although the tasks are cpu-bound, they do just a little bit of async waiting, usually on other tasks in the system.

4 Likes

I don't have the knowledge to give a great breakdown on why not to use Tokio, but as far as an async threadpool implementation, you may want to look at bevy_tasks, which is used in Bevy to do async threadpools, and IIRC it got great performance for Bevy compared to Rayon, which it used to use. Plus it supports async.

I think your description of your workload is incomplete. If your tasks are async functions, then they would call .await somewhere (otherwise, there's really no point in making them async). When they do, then this creates dependencies either between the tasks, or between the task and some external elements (like network I/O, etc.). Before debating whether tokio makes sense for you, I think you'd need to share a better understanding of your tasks.

If they are 100% CPU bound, and have no dependencies on each other, why make them async? In this case, you could probably better run them with rayon's spawn.

When you have a runtime for running async tasks, the less thing you want is to block other tasks from running since the goal is that your system is always responsive.

As for the scheduler, I don't see any contradiction.

Interesting -- I hadn't heard of bevy_tasks. The immaturity of bevy makes me a touch hesitant, but it seems pretty close to my use case. Thanks!

1 Like

Yeah, I can see that. My relatively uninformed perception is that it shouldn't be changing drastically and isn't the topic of a lot of redesign as far as I've seen so far, but I could be totally be wrong so don't quote me on that. :upside_down_face:

1 Like

Unfortunately, it's a little bit of both. Here's some examples:

  • A trivial task that flattens a Vec<Vec<i32>> to a Vec<i32>
  • A moderate compute task that awaits on 15 other tasks as inputs.
  • A moderate compute task that awaits on 15 other tasks, but interleaved between computation.
  • A moderate task that loads an image asset from disk.
  • A heavy task that calls out to another process (ie. asking Blender for a 3D asset)
  • A heavy task that does geometry computation (ie. smooth a mesh)

So while most tasks are high cpu, occasionally there will be one that waits on some I/O, reading from disk, etc. I'm working on a programming language aimed at game development, so I can't guarantee what the tasks do.

In my case, I actually don't care about the system being responsive, just that it finishes as fast as possible. If anything, responsiveness means I'm wasting CPU cycles. :stuck_out_tongue:

I've written a full blog post dedicated to this question:

Blocking the thread has the consequence of preventing other tasks from making progress, but if this doesn't bother you, then you should go ahead and use Tokio with blocking tasks.

12 Likes

I think this description of your expected workloads is so varied and wide ranging that it's impossible to make any sensible suggestion.

We could consider some extreme examples:

  1. 10,000 tasks that only take 10ns to complete.

Here I suspect that the overheads of scheduling them as threads, sync or async, would be terribly slow. I can't really accept that something that takes 10ns to complete is "CPU intensive". Would it not be better just to put those jobs into a queue and run them one at a time?

  1. 10,000 tasks that take 10 seconds each to complete.

This sounds like "CPU intensive". In this case completing them all is going to take nearly 28 hours. In this case the overheads of scheduling them is negligible compared to their execution time. So it won't matter if they are sync or async.

  1. What about something in between? An 8 core machine and 8 threads that take 10 seconds and 9992 that take 10ns.

In this case getting them all done will take somewhat over 10 seconds. Potentially 9992 of those 10ns threads are blocked from running while the 8 ten second threads are running. No matter async of sync.

What kind of game engine do you have in mind? Often games have to get their work done between frame refreshes. About 16ms. This does not sound like that.

A missing piece of you requirement is any talk of if or how all the threads are communicating or making use of shared data. And how much. That will throw another spanner in the works.

Anyway, some wise soul summed up the sync vs async choice as something like:

sync is for when you have a lot of computation to do.
async is for when you have a lot of waiting to do.

2 Likes

In that case I don't see a problem. As Alice said, you can spawn Tokio blocking tasks :slight_smile: .

If you can possibly indicate to your lanugage that "this is an IO task", or "this is a compute task", then that could work well with the design of the Bevy Tasks crate, which allowed creating thread pools with different numbers of workers for different types of work. For instance, you could make 10 compute threads and 3 IO threads, and then by spawning compute work on the compute threads, you could make sure that long-lived CPU-bound tasks won't stall the threads that are driving IO related work.

Edit: I think you could probably do this with Tokio actually, but I know that Bevy uses different pools for different kinds of work.

1 Like

Thanks for the thoughts! I've read that article, actually, and this post is somewhat in response to it. I definitely understand that idle-blocking is bad, which is what the article covers (waiting on DB, sleeping, etc). The article didn't discuss much about what to do when the blocking is actually doing valuable work.

Spawning blocking tasks seems wasteful because it may cause a thread to be allocated per task, even if it's a small one.

I think my main confusion, has been solved, though. Most of these articles talk about 'tasks not making progress', which I took to mean 'tasks not making progress, while cores are idle'. What that really means is 'cores may be fully saturated, so new tasks will be stuck until old tasks are finished'. That's perfectly fine for my use case, so it sounds like Tokio should be just fine.

6 Likes

So the system I'm working on allows you to recompute an entire game session, all at once. A game frame is defined by a directed graph of 100-1000 tasks of varying types. You might ask "What is the game state on frame 8277?", and it would recompute the entirety of the game playback up to that frame. This is an expensive operation, naturally, and will likely take at least on the order of seconds. So I want to make it saturates the cores as much as possible.

Tasks don't communicate, other than data dependencies (ie. inputs on other tasks). In this way, the task structure is closer to something like Tensorflow or MapReduce.

The challenge with this system is that games can vary wildly in the types of operations they perform. Some I expect to use a lot of small operations with lots of tasks, and some might use a couple of heavy tasks (ie. a big pathfinding calculation every frame, etc). I'm trying to pick a sensible default, for now. If this works well, I'll likely go back and tune based on real-world data. At the moment I just want to make sure Tokio isn't going to leave cores empty, if there's real work to be done.

1 Like

Interesting.. unfortunately I don't really think I expect things to be that clean. For example, an operation that loads a noise texture and then moves particles based on that texture. I/O or compute... it's kind of both. This is why I'm excited to use async, which seems to let me do I/O without worrying about thread allocation! :slight_smile:

1 Like

Yeah, I really like async for the control flow aspect of it. It just gives you really powerful ways to compose your workload as tasks without having to worry about exactly how to spawn your threads.

1 Like

I would next identify which of these tasks can be implemented in an async non-blocking fashion, which may be difficult and depending on the specific API/library used.

Then, you could try this: place all the potentially blocking tasks into separate threads (with tokio's spawn_blocking) and use the async threadpool for the rest. Do not use spawn_blocking for your CPU heavy tasks.

Then benchmark your CPU utilization/load average. If you're way above the number of CPUs, reduce the number of workers tokio uses for its async tasks.

I've been meaning to comment on this post. It's of course technically not wrong, but I somewhat question the custom definition of "blocking" it provides. Traditionally (going back to the MULTICS scheduler) "blocked" means not schedulable. When an async task uses the CPU and fails to yield, this is not really blocking. Redefining the term in the context of an async engine may not be superhelpful, because the strategies you apply when your engine's latency increases due to CPU-bound tasks that fail to yield and when your CPU utilization dives due to tasks performing blocking I/O are often different.

Notably, if you have a large number of tasks that fail to yield for significant periods of time, you probably can't just throw them into the spawn_blocking thread pool as this affects your overall CPU capacity (since CPU bound "blocking" threads will still compete with the hopefully flowing-along, desired to-be-low-latency async tasks.) This may be the case in this thread's poster's scenario.

I was just about to make a similar post. There is some ambiguity over that terminology used to classify types of threads.

I don't think there is any problem with the usage of "blocking" in that blog post. But it does require understanding the context. One has to look at it from the point of view of the async run time.

From that context "blocking thread" is one that will block execution of other async threads. It can do this by a) Doing a lot of intense calculation for a long time without yielding with await, b) By calling a synchronous I/O function that will take a while to complete. Both of these will block any scheduling of threads on the core the block thread is running on.

I know little of MULTICS but I have worked with, and sometimes created, cooperative schedulers* quite a few times. I'm sure we would have used the term "blocking" to describe threads that hang up the cooperative scheduler in the same way as it is used in discussion of blocking of the async scheduler in the likes of Tokio.

Also it helps disambiguate things to noticing that the async run time schedules "tasks" not "threads". So as to distinguish what the async run time schedules from typical POSIX like threads.

So I don't think there is any redefinition of terms going on here.

[*] Cooroperative schedulers were great. Small, simple and fast. No need for complex mutes and such. Ideal for the slow processors with limited memory of embedded systems back in the day.

Edit: I notice Tokio has a yield_now() method that can be used to prevent a task that otherwise has nothing to await on from block the async scheduler. So one can do this:

    task::spawn(async move {
        loop {
            // Do some small part if a long calculation
            // ..
            //...
            // Yield, allowing other async tasks to be run.
            task::yield_now().await;
        }
    });

Begins to feel like the good old cooperative scheduler days :slight_smile:

Yeah, this sounds pretty close to my plan! Thanks, and good point about libraries that aren't async efficient. Hopefully that will improve with time, though.

My primary confusion has been that for most systems, responsiveness/latency is an incredibly important metric.

In my case, I actually want the threads to block, to avoid unnecessary task switching, because I only care about throughput. But it seems that's pretty clearly an unusual need.

I'll read up more on bevy_tasks. It's interesting that they went with a different approach over tokio. I wonder why, given that Tokio does seem fairly suitable for their needs, too.