I was thinking that another aspect of async code performance which might be interesting to benchmark is coroutine creation vs thread creation. For servers with short-lived connections, that's an important figure of merit, and reuse of I/O threads, while technically feasible, is sufficiently tricky to implement correctly and efficiently to make async tasks look almost simple in comparison.
NPTL (Native POSIX Threads for Linux, the present thread implementation in glibc) actually implements thread stack re-use in the library. One of the reasons NPTL beat the competing N:M userspace pthread library IBM was developing was that NPTL ended up being faster at creating threads than the N:M library, even though lightweight thread creation was supposed to be the (more complex) N:M design's point of strength.
Maybe async is even faster still! But it's important to actually measure.
Meanwhile, I tried my own suggestions of increasing epoll concurrency and per-task CPU load just to see how it would affect async performance.
On my laptop's "Intel(R) Core(TM) i7-4720HQ CPU @ 2.60GHz", which turns out to have a performance profile pretty close to the Xeon chip mentioned above...
- Splitting the bucket brigade into 8 chains is enough to make the async version perform as well as the forcefully serialized threaded version (with
taskset -c 0), splitting it in more chains does not change the picture anymore so I suspect this is just amortization of epoll overhead.
- Adding trivial CPU busywork (just a loop performing volatile reads on the buffer and making sure that it has the expected value) in each task shows that CPU work execution is quite a bit more efficient in the async version.
- Strangely enough, the gap between the two versions kept increasing indefinitely as I increased the CPU load until the benchmark became unbearably long-running (without touching the iteration count at least, it's still possible to go larger by tuning it down of course).
- For 65536 volatile reads + tests on each of the 512 tasks, I get 11ms/iteration in the async version (which is pretty much optimal if you consider that this CPU has a sequential turbo frequency of 3.6 GHz and the execution time without the busy loop was ~2ms/iteration) whereas the threaded version spends 16ms/iteration.
- I'm not sure why this is happening, this is certainly not what I expected. I thought the gap would widen, then narrow again as the context-switching overhead of threads would amortize. Maybe this is somehow linked to oversubscription of the kernel scheduler, or maybe I just didn't take the experiment far enough?
- Once enough CPU work is added, the tokio rt-threaded implementation actually starts parallelizing decently well, though not as well as the pure threaded version: with 3.7ms/iteration for the tokio version and 4.7ms/iteration for the threaded version, the gap between the two versions has narrowed a bit. I suspect that is because the tokio scheduler either 1/needs more optimization work or 2/should use more of my system threads (which it doesn't, most likely due to defiance against hyperthreading's lack of reliability, which as we've seen is justified on IO- or syscall-bound workloads).
The only significant advantage and the whole point is memory usage as OP's benchmark shown, since every thread requires at least an 8-16KB kernel-mode stack and a 4KB-2MB user-mode stack, plus other memory allocations in both kernel and user space, whereas a future can typically just be around 100 bytes.
CPU performance should be a similar order of magnitude, in fact blocking I/O might be faster in some cases.
I/O performance should of course be the same assuming the same number of I/O requests are in-flight (however, if the maximum number of threads is not enough to saturate I/O, then of course async will be massively faster).
Exactly. As noted above async is for waiting, whilst doing nothing, on many things in the space you have available. Not for processing a lot of compute work.
I am yet to be convinced that async is what one should be doing unless one is handling thousands of net connections and such like.
One clarification I didn't see anyone mention yet is that asynchronicity in general is often motivated by responsiveness. Especially when the program in question expects to get a lot of tasks of varying sizes, and is expected to respond to small tasks quickly even when it has ongoing large tasks (GUIs and web servers are often like this). When the performance goal is to maximize responsiveness over throughput / total execution time, the more lightweight, user-space mechanisms are often the best choices simply because they allow you to break up those large tasks into far more tiny, interruptable chunks for the same overhead.
Unfortunately I'm not good enough at benchmarking to figure out how to meaningfully demonstrate this, but no one else mentioned it.
In contexts where throughput is the only thing that matters, AFAIK the rest of the thread hit the nail on the head:
It is a bit of a tangential, but I would like to add that another important differenence between async or coroutine-style concurrency and using OS threads is that OS threads are scheduled preemptively and using configurable policies (e.g. priorities, niceness, classes) whereas one has little control over the scheduling of the coroutines in an async application and a single future that fails to yield can block a whole executor thread so that starvation and lock-ups can become real issues again that need to be managed using e.g. "blocking" thread pools. So IMHO, there is a cost to not letting the OS scheduler see these tasks as separate entities and hence being unable to do its job.
One clarification I didn't see anyone mention yet is that asynchronicity in general is often motivated by responsiveness .
I would expect preemptive threads to be preferable to async designs in cases like this: to keep async code responsive, the programmer must identify all potentially lengthy calculations and arrange to yield in a way that breaks them up. With preemptive threads, suspension is possible at any point, including library code.
I'm not sure what "responsive" is being referred too here.
Typically "responsive" is used to describe GUIs. Be they in Windows or in the browser.
User clicks a button, something should happen. The GUI should not freeze and become unresponsive.
That means the "something" has to take very little time. You don't need threads. You might want an "event loop" to handle all those user actions. If the "something" will take time it's better to indicate to the user "it's happening" and start a thread to actually do it. Perhaps on a different core now a days.
Traditional GUIs, like Windows and the browser, are async.
Turns out this "responsive" is also useful when dealing with thousands of HTTP connections and other communication tasks. When the overheads of threads start to make everything unresponsive.
Async is for waiting on lots of things and being "responsive" when they happen.
Threads are for doing long term work.
I don't think anybody here disagrees that there is a certain scale where the fixed overhead of an OS thread is so much higher than that of an async task that a coroutine-style system will continue to scale where thread-based system will not.
However, a lot of people seem to fail to estimate (or better yet measure) where that scale begins for their particular use case. The advantages of thread-based systems like better isolation and fairness are often not even discussed.
It's a good question. And one the our OP here started with trying to measure.
I'm at a loss to know how or what to measure in order to make the call between async and threads for my applications. It's so dependent on what kind of work one is doing and how one expects it to be scheduled. Does one have to write the application first to find out one made the wrong choice?
As it stands, we have a ton of complexity in our operating systems to handle threads and make our programming simple. I suspect we should not adopt a ton of complexity into our programming in order to avoid threads, unless we actually need it. As the async chapter of the Rust book says.
In my personal opinion, yes. But I also think that it usually is a rather incremental process instead of a single choice, i.e. one measures the performance of an existing system and then tries local modifications to the parts that stood out.
For example, we recently had a problem with an embedded service that is memory constrained so that we decided to multiplex all work items onto a single thread. We then noticed that certain rare maintenance operations would take an unreasonably long time as the executor was busy servicing the much more numerous events from the main workload.
Hence we decided to have two threads, one servicing futures for the main workload that was mostly busy and another one that would be mostly idle but would spring to action to quickly perform maintenance operations when required. Very little of the application changed expect for initializing an additional executor and spawning the right tasks onto each one.
You don't say how your application was built. But that statement implies you had already adopted the ton of complexity that is an async framework. So farming some part out to an actual thread or two was simple.
Yes, that particular application is based on Tokio. I would still say that using an async framework is only a piece of the puzzle, i.e. one usually tries to have a somewhat isolated functional core where a question like threading or async IO is mostly irrelevant. In this particular case, most of the code is in libraries that have no dependency on Tokio whatsoever and if we ever decided to use something different, the bet is that only the wiring in the top-level crate would need to change.
This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.