When/why is async good for performance?

I'm definitely not sure I know what I'm doing here, but I put together a pair of programs to compare the cost of a kernel context switch with a Rust async task switch, both driven by readiness of a kernel resource. Each program spawns a long chain of tasks (kernel threads in one, Tokio tasks in the other) connected by pipes (actually UnixStream socket pairs), drops a byte in one end, and measures how long it takes to come out the other end.

I'm seeing the async version about 30% faster than the thread-based version, and using about 1/20th the memory per task. The repository README.md has more details.

This is a microbenchmark, and I'd be delighted to learn about its limitations. But the explanations I've encountered for why async designs help achieve good performance seem to rest on the idea that user-space context switches (from one future to another) are vastly more efficient than kernel context switches (from one thread to another), and although they are faster (30% is nothing to sneeze at), they're not overwhelmingly so in this particular setup.

Are there better ways to tease out exactly where async's performance advantage comes from?

6 Likes

Let me copy those results here for context:

     Running `/home/jimb/rust/async/context-switch/target/release/async-brigade`
10000 iterations, 500 tasks, mean 1.858ms per iteration, stddev 129.544µs
12.49user 8.39system 0:19.37elapsed 107%CPU (0avgtext+0avgdata 152832maxresident)k

     Running `/home/jimb/rust/async/context-switch/target/release/thread-brigade`
10000 iterations, 500 tasks, mean 2.763ms per iteration, stddev 537.584µs
10.16user 27.47system 0:28.26elapsed 133%CPU (0avgtext+0avgdata 133520maxresident)k
0inputs+5568outputs (13major+23830minor)pagefaults 0swaps

Here are some interesting things that I notice in them:

  • Overall, you are very much bound by syscall performance, with ~43% of the time measured in the kernel (system time) even in the async version. That's not surprising given how the benchmark works, but I'm not sure if it's representative of what actual applications do.
    • Further, system time accounting is not an exact science, and a system-wide sampling profiler like perf which can more precisely pinpoint time spent in the kernel by tracking where the instruction pointer is suggests that the async version actually spends 72% of its time in the kernel.
    • If your system can run perf, you can check this yourself by passing -e cycles:ppp to perf record so that it tracks time spent in the kernel, and asking for a summary of time spent in each DSO involved (program, libc, kernel...) with the --sort=dso option to perf report.
    • If you're bound by syscall performance, it makes sense that the async version is not terribly impressive, since it performs more syscalls than the sync version (you get epoll syscalls in addition to the read and write syscalls that were also there in the threaded version).
  • Threaded version clearly uses multiple CPU cores (notice the 133% CPU in time output), while async version barely does, and in that sense the former has access to more hardware resources. For a fair comparison, you will want to measure on a fully loaded CPU. The simplest way is to run one copy of the benchmark per core.
  • Threaded version spends 3.3x as much time in the kernel (per the accounting of time, see above for caveat). This is bad news because it means you are more reliant on kernel performance, which you have no leverage on unlike your application. Production systems tend to have very old kernels which are not well-tuned for modern hardware.
  • Please publish your results with many tasks as well, even if the wall-clock timings don't change much other measurements may change.
6 Likes

I just tried my own suggestion of filling up all the CPU hyper-threads with N copies of the benchmark, and interestingly enough, the two implementations become equally fast in this scenario:

> /usr/bin/time -v parallel ::: ./target/release/async-brigade ./target/release/async-brigade ./target/release/async-brigade ./target/release/async-brigade
10000 iterations, 500 tasks, mean 3.777ms per iteration, stddev 47.714µs
10000 iterations, 500 tasks, mean 3.802ms per iteration, stddev 380.449µs
10000 iterations, 500 tasks, mean 3.837ms per iteration, stddev 309.074µs
10000 iterations, 500 tasks, mean 3.854ms per iteration, stddev 316.673µs
        Command being timed: "parallel ::: ./target/release/async-brigade ./target/release/async-brigade ./target/release/async-brigade ./target/release/async-brigade"
        User time (seconds): 65.88
        System time (seconds): 87.49
        Percent of CPU this job got: 392%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:39.05
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 18364
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 11065
        Voluntary context switches: 148
        Involuntary context switches: 44543
        Swaps: 0
        File system inputs: 0
        File system outputs: 3496
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

> /usr/bin/time -v parallel ::: ./target/release/thread-brigade ./target/release/thread-brigade ./target/release/thread-brigade ./target/release/thread-brigade
10000 iterations, 500 tasks, mean 3.972ms per iteration, stddev 242.167µs
10000 iterations, 500 tasks, mean 3.976ms per iteration, stddev 227.694µs
10000 iterations, 500 tasks, mean 3.984ms per iteration, stddev 226.737µs
10000 iterations, 500 tasks, mean 4.006ms per iteration, stddev 314.698µs
        Command being timed: "parallel ::: ./target/release/thread-brigade ./target/release/thread-brigade ./target/release/thread-brigade ./target/release/thread-brigade"
        User time (seconds): 27.47
        System time (seconds): 130.81
        Percent of CPU this job got: 389%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:40.67
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 18372
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 15618
        Voluntary context switches: 20242601
        Involuntary context switches: 197075
        Swaps: 0
        File system inputs: 0
        File system outputs: 3512
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Then again, hyperthreading is known to have weird performance characteristics, so maybe I should restrict myself to one process per physical CPU core?

> /usr/bin/time -v parallel ::: ./target/release/async-brigade ./target/release/async-brigade && /usr/bin/time -v parallel ::: ./target/release/thread-brigade ./target/release/thread-brigade
10000 iterations, 500 tasks, mean 2.337ms per iteration, stddev 116.532µs
10000 iterations, 500 tasks, mean 2.374ms per iteration, stddev 98.790µs
        Command being timed: "parallel ::: ./target/release/async-brigade ./target/release/async-brigade"
        User time (seconds): 21.77
        System time (seconds): 25.90
        Percent of CPU this job got: 197%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:24.10
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 18288
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 8294
        Voluntary context switches: 124
        Involuntary context switches: 12790
        Swaps: 0
        File system inputs: 0
        File system outputs: 2792
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
10000 iterations, 500 tasks, mean 3.283ms per iteration, stddev 186.371µs
10000 iterations, 500 tasks, mean 3.297ms per iteration, stddev 173.724µs
        Command being timed: "parallel ::: ./target/release/thread-brigade ./target/release/thread-brigade"
        User time (seconds): 16.28
        System time (seconds): 66.26
        Percent of CPU this job got: 246%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:33.53
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 18336
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 10564
        Voluntary context switches: 10121345
        Involuntary context switches: 84160
        Swaps: 0
        File system inputs: 0
        File system outputs: 3072
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Here are sequential results on this system for reference, notice how the difference between async and threaded wasn't as large as you observed to begin with on this older i3 chip ("Intel(R) Core(TM) i3-3220 CPU @ 3.30GHz" to quote lscpu)...

> /usr/bin/time -v ./target/release/async-brigade && /usr/bin/time -v ./target/release/thread-brigade
10000 iterations, 500 tasks, mean 2.139ms per iteration, stddev 76.927µs
        Command being timed: "./target/release/async-brigade"
        User time (seconds): 10.56
        System time (seconds): 11.03
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:21.61
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 2272
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 165
        Voluntary context switches: 1
        Involuntary context switches: 5671
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
10000 iterations, 500 tasks, mean 2.453ms per iteration, stddev 116.208µs
        Command being timed: "./target/release/thread-brigade"
        User time (seconds): 6.98
        System time (seconds): 25.25
        Percent of CPU this job got: 129%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:24.89
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 17072
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 1304
        Voluntary context switches: 5060606
        Involuntary context switches: 3211
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

...so clearly, giving more threads to the async version increases its performance advantage wrt the threaded version on my machine, but not dramatically so, and hyperthreads are useless on this benchmark.

3 Likes

Perhaps it's an idea to remove all that dependency on the kernel in passing data around by building the chain in memory using channels?

What is the memory consumption when all this is going on?

My null hypothesis is that async architectures actually have no performance advantage, but people who bother to write async code are also likely to be measuring their performance, and that's actually why their code is faster. I don't think that's true, but that places the burden on me to show why.

The difference in resident set sizes is remarkable, and I think that establishes one argument for the value of async architectures quite firmly: if each task doesn't need too much memory, but you're running so many tasks that you're still limited by per-task memory consumption, you're going to be able to do a lot more on a given piece of hardware with async.

The value for non-memory-bound applications is much less clear to me.

I haven't had luck finding literature on this. If anyone has pointers to good papers, please let me know.

4 Likes

It all starts with the "c10k" problem. As formulated by Dan Kegel in 1999: The C10K problem

Which states the motivation in it's first line:

It's time for web servers to handle ten thousand clients simultaneously, don't you think? After all, the web is a big place now.

Of course that was 1999, we may expect to handle a million connection per server today.

It was clear that the old Unix process "fork" as used by early Apache was not going to do it. Processes are two heavy weight. Turns out threads are pretty heavy weight as well.

What is needed is to maintain the state of many TCP/IP connections, which is quite small, but be able to wait on them without all the overheads of processes and threads. Especially memory overheads. That leads us to some single threaded, event driven, programming model.

A famous first implementation of this was by Ryan Dahl with his creation of Node.js. It worth watching the first ever Node.js presentation to get the full motivation:

Ryan Dahl: Original Node.js presentation:

Node.js has an event loop, as that is built into the very being of Javascript in the browser. It handles socket connections, and all other external interfacing, with events that run functions as and when they happen. All done with setting callbacks on those events.

Since then people decided that all that call back stuff was two complicated to write and they wanted something that looked more like normal code. Hence the arrival of async/await in Python, Rust even C++ I believe.

Personally so far I feel node.js with it's event loop and callbacks is the simplest implementation. It's just so natural and easy to use. Meanwhile async/await everywhere else is a complex kludgy mess that is almost impossible to reason about.

As far as I can tell 99% of code does not need async/await and threads are a perfectly fine and simpler approach. As the Rust book says in the introduction to the async chapter:

It's important to remember that traditional threaded applications can be quite effective, and that Rust's small memory footprint and predictability mean that you can get far without ever using async . The increased complexity of the asynchronous programming model isn't always worth it, and it's important to consider whether your application would be better served by using a simpler threaded model.

It worth going to youtube and searching for "rust lang async". There are a number of presenations then discussing the motivation for async. For example:

Steve Klabnik's "Rust's Journey to Async/Await":

13 Likes

I would suggest adding a controllable chunk of CPU busywork to your bucket-passing benchmark inbetween the receive and the send to see what happens. Maybe pass around a little chunk of data and turn worker algorithm into "fetch data, do a few iterations of some algorithm, and send the output to the next peer".

Intuitively, I would expect the heavily threaded version to possibly suffer from heavier CPU state trashing (pipeline flushes, etc) at the point where it resumes from the blocking read, then recover to a well-filled pipeline as execution goes on.

So the worst-case situation for the threaded version would be waking up, doing a little bit of CPU work, and falling back asleep.

But I never measured this effect quantitatively, so I don't know 1/how bad it is and 2/how much CPU work inbetween context switches it takes to amortize it.

Also, note that this design will allow the threaded version to parallelize better, so you'll really want to compare with a parallel run of the async version.

I think that would be an interesting experiment for you to carry out, and I would be very interested in hearing about the results.

Keep in mind that regular threads aren't all that portable as they definitely won't work on e.g. WebAssembly (which was a nasty surprise for me anyway).
Since Futures and JS Promises can seamlessly work together, for people who like or need their code to be easily portable, using Futures for concurrency is starting to look like a superior alternative in general, and the fact that Futures are lower overhead than regular threads only seals that fate.

3 Likes

That is an interesting point.

I have always insisted that all the projects I work on are as portable as possible. Typically code for embedded systems where it's great to be able to develop, run and test the code on Windows, Mac and Linux as well as the intended target. As well as keeping team member happy with their preferred environment it also seemed to shake out a lot of potential bugs. But the main point is to be proof against future hardware and operating system changes.

Recently I started to ponder the idea of using WASM in some embedded systems. That makes a great sand box. It makes for easy application upgrades. Would be very portable as you say, given that there is a WASM run-time for your target of course.

Maybe I have have to review my thinking on async...

1 Like

These are the results of simultaneous execution of both on my Mac:

/Users/jony/.cargo/bin/cargo run --color=always --bin async-brigade --release
Finished release [optimized] target(s) in 0.03s
Running target/release/async-brigade
10000 iterations, 500 tasks, mean 2.417ms per iteration, stddev 272.143µs

/Users/jony/.cargo/bin/cargo run --color=always --bin thread-brigade --release
Finished release [optimized] target(s) in 0.04s
Running target/release/thread-brigade
10000 iterations, 500 tasks, mean 2.143ms per iteration, stddev 205.473µs

It's somewhat I'd expect and was a bit surprised by your results =)

Yes, thanks for posting this talk, I came here to do so :slight_smile:

@jimb it's not exactly short, but if you watch it, I'm happy to answer questions in this thread. What I would say right here is basically what I said in the talk.

1 Like

Thanks for the careful reply. I think my question wasn't clear. I'm familiar with the history of async programming (the C10K manifesto; Erlang; node.js; async in JS; Rust's journey from green threads to Futures 0.1/0.2 to async/await).

But almost all of that exposition has left me feeling unsatisfied and uncomfortable. It's asserted over and over that threads are too heavy, and yet (with one exception) every bit of overhead in threads has an analogue in async:

  • A thread context switch must swap register contents. But async code must return Poll::Pending from some stack of calls - restoring saved registers along the way, notice - and then poll the next future, which performs a bunch of calls to get to the point of re-trying the poll call that blocked it previously.

  • A thread gets suspended on some kernel queue, and then moved back to the run queue. But async code must perform a system call which fails with EAGAIN, return to the executor, call epoll to get the next event, and move the future back into a run queue. (I/O-driven tasks are the stated use case for async code, so I'm assuming it's system calls that are blocking.)

  • A thread context switch involves a transition into the kernel and back, switching address spaces. But async code must perform two system calls, as above, that a blocking thread does not, and each of those system calls is a transition into the kernel and back.

I've actually had people get annoyed with me when I draw these comparisons, because they feel it's obvious that the kernel's versions are heavier. But if that were so, and not just marginally but decisively, then we should be able to construct a microbenchmark that shows it. Hence my post, which I think shows that the overhead of a task switch is comparable in both approaches.

Where async does have a clear advantage is in memory consumption. In C, C++, and Rust, we usually cope with the inability to segment or relocate our stacks by allocating so much space that we don't need to worry about it. But an asynchronous function has space for all the frames that will ever need to be suspended in the future it returns, sized perfectly by the compiler. It's true that the kernel only allocates physical pages of memory to a thread's stack once the program touches them, so the allocated size is not the used size. But the microbenchmark's resident set sizes take this into account, and indeed, the resident set size of the async version is an order of magnitude smaller.

A principled case for async programming must make clear what advantage is expected. People suggest that I/O-bound applications benefit, but I think that's untrue. If an application is I/O-bound but not memory-bound, I predict it will see no advantage from an async implementation. It is specifically memory-bound applications whose per-task memory use is small but which run many tasks, that benefit.

12 Likes

I have watched the video; it's great background. But I think it contains the misapprehension that I described: that I/O somehow benefits from async implementations. An application that is I/O-bound but not per-task memory-bound will not benefit from Rust's async features, at least as far as I understand the situation now.

I wish I could find a link now but it the whole threads vs async thing was summed up by somebody who basically said something like:

Threads are good when you have work to do. Async is good when you have a lot of waiting to do.

Basically if you have compute work that can be parallelize then threads enable it to be distributed over the many cores you have.

On the other hand if you are waiting on a million HTTP connections, database connections etc the async is the way to do it.

So there we have it: Do you want to work or wait?

3 Likes

Just to make sure we are on the same page re. syscall overhead, note that typical users of asynchronous I/O (e.g. web server with lots of concurrent connections) will perform less epoll syscalls than they perform read and write syscalls, because each epoll syscall returns a list of ready connections, not the identifier of a single ready connection.

So even though, as @ZiCog points out, async is all about waiting for multiple things at the same time, the optimal scenario from a syscall overhead point of view is actually when you learn about multiple ready clients every time you "wait". Which, if I understand epoll's design, means that you don't actually wait, you just fetch a list of ready clients.

To measure this effect in your benchmark, you could replace your single bucket brigade with N concurrent bucket brigades with N times less workers in the chain each, and study how the overheads change.


If you wanted to optimize syscall overhead even more, you'd need to forgo the readiness-based asynchronous I/O design of Linux' epoll, and go for a completion-based design like Windows' IOCP or Linux' io_uring. Then you don't need to do a redundant read/write syscall upon getting an async I/O notification, as the notification tells you that the corresponding read or write has already been performed for you :wink:

However, completion-based designs also cause higher buffer management complexity since you need to keep application-side buffers alive as long as the I/O is occuring. And I suspect that naive use of them will also cause increased RAM usage due to bufferbloat from all the data sitting somewhere waiting to be read/written. Avoiding this may require manually splitting I/O requests in chunks, which read/write syscalls would do automatically for you in readiness-based designs.

TL;DR as far as I can see, epoll-style readiness-based async I/O optimizes ergonomics for scenarios where RAM is the scarce resource and syscalls are considered cheap.


EDIT: Oh, and a few more minor things:

A thread context switch must swap register contents. But async code must return Poll::Pending from some stack of calls - restoring saved registers along the way, notice - and then poll the next future, which performs a bunch of calls to get to the point of re-trying the poll call that blocked it previously.

Note that in this case, the async runtime only saves and restores the subset of CPU registers that were actually clobbered by function calls (and that is when the function calls where not inlined altogether), unlike with kernel context switches where the kernel saves and restores unrelated state which the application never touches like SIMD registers because it doesn't actually know what the application is doing.

So as long as you don't have a deep chain of function calls with complex code and poor inlining, the async case should be able to achieve lower overhead here. This is analogous to how coroutines can be less RAM-hungry than threads because they leverage compiler knowledge of the application for space optimization.

In a similar way, user-mode scheduling queues like those of future executors should be able to achieve lower overhead than kernel scheduling queues because their scheduling logic is much simpler (e.g. almost no prioritization beyond FIFO fairness).

Now, I would not expect these effects to play that big of a role as long as the async benchmark is purely syscall-bound, hence my suggestions above to try 1/amortizing syscall overhead by leveraging epoll's notification batching ability and 2/adding a bit more CPU work on the application side, which I believe to be more realistic than the current "wake up and immediately fall back asleep" workload.

6 Likes

And now, for something completely effed-up...

I just ran the benchmarks on a system which is equivalent to the one that I used previously at the software level (same kernel version, same Rust version, etc), but has a very different CPU (Intel(R) Xeon(R) CPU E5-1620 v3 @ 3.50GHz) which has...

  • A slightly newer micro-arch (Haswell instead of Ivy Bridge)
  • A slightly higher clock rate (3.5 GHz instead of 3.3 GHz)
  • 2x as many cores (that's what I was curious about, actually)
  • 3x as much L3 cache
  • 2x as much memory channels and 2.6x as much bandwidth (the difference boils down to DDR4 vs DDR3 and higher supported DRAM clock)

And the result was...

  1. A much bigger difference between the async version and the threaded version in sequential mode (with the former now being 2.6x faster and having 10x less timing spread)
  2. Much less of a difference between sequential and parallel runs, with both versions scaling near-perfectly until the physical core count is reached and gaining mostly no benefit from HT except that...
  3. The threaded version works significantly better in the oversubscribed HT regime on this machine.
hadrien@linux-2ak3:~/Bureau/context-switch/target/release> cargo +nightly build --release && /usr/bin/time -v ./async-brigade && /usr/bin/time -v ./thread-brigade && /usr/bin/time -v parallel ::: ./async-brigade ./async-brigade && /usr/bin/time -v parallel ::: ./thread-brigade ./thread-brigade && /usr/bin/time -v parallel ::: ./async-brigade ./async-brigade ./async-brigade ./async-brigade && /usr/bin/time -v parallel ::: ./thread-brigade ./thread-brigade ./thread-brigade ./thread-brigade &&  /usr/bin/time -v parallel ::: ./async-brigade ./async-brigade ./async-brigade ./async-brigade ./async-brigade ./async-brigade ./async-brigade ./async-brigade && /usr/bin/time -v parallel ::: ./thread-brigade ./thread-brigade ./thread-brigade ./thread-brigade ./thread-brigade ./thread-brigade ./thread-brigade ./thread-brigade                                                                                                                                    Finished release [optimized] target(s) in 0.01s
10000 iterations, 500 tasks, mean 1.855ms per iteration, stddev 36.652µs
        Command being timed: "./async-brigade"
        User time (seconds): 10.59
        System time (seconds): 8.15
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:18.75
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 2352
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 168
        Voluntary context switches: 1
        Involuntary context switches: 4690
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
10000 iterations, 500 tasks, mean 4.866ms per iteration, stddev 301.105µs
        Command being timed: "./thread-brigade"
        User time (seconds): 15.25
        System time (seconds): 49.48
        Percent of CPU this job got: 131%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:49.29
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 13368
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 1363
        Voluntary context switches: 5060606
        Involuntary context switches: 1572
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
10000 iterations, 500 tasks, mean 1.847ms per iteration, stddev 46.360µs
10000 iterations, 500 tasks, mean 1.872ms per iteration, stddev 79.085µs
        Command being timed: "parallel ::: ./async-brigade ./async-brigade"
        User time (seconds): 21.71
        System time (seconds): 16.45
        Percent of CPU this job got: 190%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:20.03
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 23308
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 10
        Minor (reclaiming a frame) page faults: 31809
        Voluntary context switches: 369
        Involuntary context switches: 9703
        Swaps: 0
        File system inputs: 9184
        File system outputs: 2624
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
10000 iterations, 500 tasks, mean 4.847ms per iteration, stddev 247.975µs
10000 iterations, 500 tasks, mean 4.858ms per iteration, stddev 268.127µs
        Command being timed: "parallel ::: ./thread-brigade ./thread-brigade"
        User time (seconds): 26.70
        System time (seconds): 97.83
        Percent of CPU this job got: 252%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:49.30
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 18320
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 12037
        Voluntary context switches: 10121370
        Involuntary context switches: 3533
        Swaps: 0
        File system inputs: 0
        File system outputs: 3400
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
10000 iterations, 500 tasks, mean 1.930ms per iteration, stddev 86.004µs
10000 iterations, 500 tasks, mean 1.958ms per iteration, stddev 121.516µs
10000 iterations, 500 tasks, mean 1.964ms per iteration, stddev 197.596µs
10000 iterations, 500 tasks, mean 2.010ms per iteration, stddev 199.950µs
        Command being timed: "parallel ::: ./async-brigade ./async-brigade ./async-brigade ./async-brigade"
        User time (seconds): 43.46
        System time (seconds): 36.37
        Percent of CPU this job got: 388%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:20.56
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 18320
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 14000
        Voluntary context switches: 156
        Involuntary context switches: 21729
        Swaps: 0
        File system inputs: 0
        File system outputs: 2952
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
10000 iterations, 500 tasks, mean 4.647ms per iteration, stddev 162.274µs
10000 iterations, 500 tasks, mean 4.655ms per iteration, stddev 163.901µs
10000 iterations, 500 tasks, mean 4.661ms per iteration, stddev 168.954µs
10000 iterations, 500 tasks, mean 4.660ms per iteration, stddev 173.542µs
        Command being timed: "parallel ::: ./thread-brigade ./thread-brigade ./thread-brigade ./thread-brigade"
        User time (seconds): 39.11
        System time (seconds): 174.65
        Percent of CPU this job got: 451%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:47.34
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 18300
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 18716
        Voluntary context switches: 20242611
        Involuntary context switches: 5957
        Swaps: 0
        File system inputs: 0
        File system outputs: 3560
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
10000 iterations, 500 tasks, mean 3.476ms per iteration, stddev 50.389µs
10000 iterations, 500 tasks, mean 3.511ms per iteration, stddev 189.990µs
10000 iterations, 500 tasks, mean 3.526ms per iteration, stddev 324.798µs
10000 iterations, 500 tasks, mean 3.603ms per iteration, stddev 113.295µs
10000 iterations, 500 tasks, mean 3.604ms per iteration, stddev 572.686µs
10000 iterations, 500 tasks, mean 3.612ms per iteration, stddev 137.070µs
10000 iterations, 500 tasks, mean 3.620ms per iteration, stddev 292.860µs
10000 iterations, 500 tasks, mean 3.666ms per iteration, stddev 570.292µs
        Command being timed: "parallel ::: ./async-brigade ./async-brigade ./async-brigade ./async-brigade ./async-brigade ./async-brigade ./async-brigade ./async-brigade"
        User time (seconds): 130.36
        System time (seconds): 155.92
        Percent of CPU this job got: 770%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:37.18
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 18408
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 22630
        Voluntary context switches: 249
        Involuntary context switches: 89419
        Swaps: 0
        File system inputs: 0
        File system outputs: 4128
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
10000 iterations, 500 tasks, mean 3.713ms per iteration, stddev 229.816µs
10000 iterations, 500 tasks, mean 3.726ms per iteration, stddev 224.892µs
10000 iterations, 500 tasks, mean 3.738ms per iteration, stddev 219.007µs
10000 iterations, 500 tasks, mean 3.742ms per iteration, stddev 256.988µs
10000 iterations, 500 tasks, mean 3.748ms per iteration, stddev 257.293µs
10000 iterations, 500 tasks, mean 3.749ms per iteration, stddev 229.886µs
10000 iterations, 500 tasks, mean 3.751ms per iteration, stddev 225.626µs
10000 iterations, 500 tasks, mean 3.760ms per iteration, stddev 270.486µs
        Command being timed: "parallel ::: ./thread-brigade ./thread-brigade ./thread-brigade ./thread-brigade ./thread-brigade ./thread-brigade ./thread-brigade ./thread-brigade"
        User time (seconds): 57.31
        System time (seconds): 239.67
        Percent of CPU this job got: 775%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:38.28
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 18324
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 32009
        Voluntary context switches: 40483621
        Involuntary context switches: 53429
        Swaps: 0
        File system inputs: 0
        File system outputs: 3688
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Since the serial execution timings of the async version don't change very much wrt my older i3, and only the threaded version is suffering a lot, this is probably not caused by differences in microarchitecture, cache capacity, or memory bus between these two systems (as I can't think of a reason why these differences would only affect the threaded version, let alone in a highly negative way).

Therefore, by elimination, core count seems like the only variable which varies enough to qualify as a possible explanation to this surprising behavior. This would also be consistent with number of concurrent processes significantly affecting the results.

My insufficiently tested hypothesis is that in "serial" runs of the threaded version, the kernel might try to spread the threads across physical CPU cores for parallelism, which would be detrimental for this benchmark because we're just increasing inter-core communication for no actual parallel speedup benefit.

In the oversubscribed regime, however, the kernel would be content with how the CPU cores are filled up and refrains from migrating threads away from the CPU core in which they were created, therefore avoiding this detrimental cross-core communication effect.

This hypothesis could be tested by tracing kernel scheduler task migration events, but my perf-fu is not strong enough for that.

Another way to test this would be to run this benchmark on a system where useless parallelism is even more costly (e.g. multi-socket system, or one of those high-end AMD chips before Zen 2 which had multiple NUMA domains per chip).


EDIT: Aaand... I just found a third way that's easy for me to test: just force the threaded version to execute on a single CPU core with taskset, and suddenly it runs much faster, even faster than the async version:

hadrien@linux-2ak3:~/Bureau/context-switch/target/release> cargo +nightly build --release && /usr/bin/time -v ./async-brigade && taskset -c 0 /usr/bin/time -v ./thread-brigade
    Finished release [optimized + debuginfo] target(s) in 0.01s
10000 iterations, 500 tasks, mean 1.817ms per iteration, stddev 71.822µs
        Command being timed: "./async-brigade"
        User time (seconds): 10.29
        System time (seconds): 8.08
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:18.38
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 2284
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 163
        Voluntary context switches: 1
        Involuntary context switches: 4599
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
10000 iterations, 500 tasks, mean 1405.914µs per iteration, stddev 42.745µs
        Command being timed: "./thread-brigade"
        User time (seconds): 3.82
        System time (seconds): 10.42
        Percent of CPU this job got: 98%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:14.39
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 16456
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 1584
        Voluntary context switches: 5059200
        Involuntary context switches: 13563
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Conclusion: the reason why large amounts of IO threads perform so badly on this system, and will likely perform even worse on systems with higher core counts or more complicated core interconnect topologies, is that threads needlessly conflate concurrency and parallelism.

A purely IO-oriented OS threading abstraction like Windows NT fibers (EDIT: That was assuming that Windows fibers were well-integrated enough in the OS to yield to the user-mode scheduler on blocking system calls. After further research, it turns out they aren't.) should be able to perform much better here.

10 Likes

Relevant talk: User-level threads... with threads - Paul Turner. tldr; when your threads communicate with each other (with synchronization involving futex), when one thread wakes another, the kernel scheduler doesn't have enough information to keep both threads on a single core, resulting in excessive IPIs (inter-processor interrupts). Also relevant because it debunks the common assumption that syscalls are inherently expensive.

1 Like

Okay, I ran this under sudo perf record -e cycles:ppp (I think that's the same as cycles:P?), and then did sudo perf report --sort=dso.

The async-brigade spent 78% (17s) of its time in the kernel, and 22% (5s) in the executable.

The thread-brigade spends 95% (40s) of its time in the kernel, and 5% (2s) in the executable.

The remaining time was in system libraries, in each case.

1 Like

Yes, I think [edit] we are on the same page - I agree my three-to-one case is not typical. What I should say is that a blocking async call is two syscalls, compared to one in the threaded case. The epoll calls are hopefully returning multiple events and getting amortized across more than one async context switch.

1 Like

Yes, that's definitely a tradeoff of the readiness-based async I/O model: you need two syscalls instead of one for a blocking operation, one to realize that you need to wait and another to actually do the work once the wait is over.

Completion-based async I/O like io_uring should not be subjected to this particular tradeoff, but io_uring itself does not seem tightly tuned for performance yet and completion-based I/O is not well-integrated in tokio & al yet (although there are efforts ongoing to improve this situation), so I'm not sure if the ecosystem is ready for an epoll vs io_uring perf characteristics comparison.