I was thinking that another aspect of async code performance which might be interesting to benchmark is coroutine creation vs thread creation. For servers with short-lived connections, that's an important figure of merit, and reuse of I/O threads, while technically feasible, is sufficiently tricky to implement correctly and efficiently to make async tasks look almost simple in comparison.
NPTL (Native POSIX Threads for Linux, the present thread implementation in glibc) actually implements thread stack re-use in the library. One of the reasons NPTL beat the competing N:M userspace pthread library IBM was developing was that NPTL ended up being faster at creating threads than the N:M library, even though lightweight thread creation was supposed to be the (more complex) N:M design's point of strength.
Maybe async is even faster still! But it's important to actually measure.
Meanwhile, I tried my own suggestions of increasing epoll concurrency and per-task CPU load just to see how it would affect async performance.
On my laptop's "Intel(R) Core(TM) i7-4720HQ CPU @ 2.60GHz", which turns out to have a performance profile pretty close to the Xeon chip mentioned above...
- Splitting the bucket brigade into 8 chains is enough to make the async version perform as well as the forcefully serialized threaded version (with
taskset -c 0), splitting it in more chains does not change the picture anymore so I suspect this is just amortization of epoll overhead.
- Adding trivial CPU busywork (just a loop performing volatile reads on the buffer and making sure that it has the expected value) in each task shows that CPU work execution is quite a bit more efficient in the async version.
- Strangely enough, the gap between the two versions kept increasing indefinitely as I increased the CPU load until the benchmark became unbearably long-running (without touching the iteration count at least, it's still possible to go larger by tuning it down of course).
- For 65536 volatile reads + tests on each of the 512 tasks, I get 11ms/iteration in the async version (which is pretty much optimal if you consider that this CPU has a sequential turbo frequency of 3.6 GHz and the execution time without the busy loop was ~2ms/iteration) whereas the threaded version spends 16ms/iteration.
- I'm not sure why this is happening, this is certainly not what I expected. I thought the gap would widen, then narrow again as the context-switching overhead of threads would amortize. Maybe this is somehow linked to oversubscription of the kernel scheduler, or maybe I just didn't take the experiment far enough?
- Once enough CPU work is added, the tokio rt-threaded implementation actually starts parallelizing decently well, though not as well as the pure threaded version: with 3.7ms/iteration for the tokio version and 4.7ms/iteration for the threaded version, the gap between the two versions has narrowed a bit. I suspect that is because the tokio scheduler either 1/needs more optimization work or 2/should use more of my system threads (which it doesn't, most likely due to defiance against hyperthreading's lack of reliability, which as we've seen is justified on IO- or syscall-bound workloads).
The only significant advantage and the whole point is memory usage as OP's benchmark shown, since every thread requires at least an 8-16KB kernel-mode stack and a 4KB-2MB user-mode stack, plus other memory allocations in both kernel and user space, whereas a future can typically just be around 100 bytes.
CPU performance should be a similar order of magnitude, in fact blocking I/O might be faster in some cases.
I/O performance should of course be the same assuming the same number of I/O requests are in-flight (however, if the maximum number of threads is not enough to saturate I/O, then of course async will be massively faster).
Exactly. As noted above async is for waiting, whilst doing nothing, on many things in the space you have available. Not for processing a lot of compute work.
I am yet to be convinced that async is what one should be doing unless one is handling thousands of net connections and such like.
One clarification I didn't see anyone mention yet is that asynchronicity in general is often motivated by responsiveness. Especially when the program in question expects to get a lot of tasks of varying sizes, and is expected to respond to small tasks quickly even when it has ongoing large tasks (GUIs and web servers are often like this). When the performance goal is to maximize responsiveness over throughput / total execution time, the more lightweight, user-space mechanisms are often the best choices simply because they allow you to break up those large tasks into far more tiny, interruptable chunks for the same overhead.
Unfortunately I'm not good enough at benchmarking to figure out how to meaningfully demonstrate this, but no one else mentioned it.
In contexts where throughput is the only thing that matters, AFAIK the rest of the thread hit the nail on the head: