And now, for something completely effed-up...
I just ran the benchmarks on a system which is equivalent to the one that I used previously at the software level (same kernel version, same Rust version, etc), but has a very different CPU (Intel(R) Xeon(R) CPU E5-1620 v3 @ 3.50GHz) which has...
- A slightly newer micro-arch (Haswell instead of Ivy Bridge)
- A slightly higher clock rate (3.5 GHz instead of 3.3 GHz)
- 2x as many cores (that's what I was curious about, actually)
- 3x as much L3 cache
- 2x as much memory channels and 2.6x as much bandwidth (the difference boils down to DDR4 vs DDR3 and higher supported DRAM clock)
And the result was...
- A much bigger difference between the async version and the threaded version in sequential mode (with the former now being 2.6x faster and having 10x less timing spread)
- Much less of a difference between sequential and parallel runs, with both versions scaling near-perfectly until the physical core count is reached and gaining mostly no benefit from HT except that...
- The threaded version works significantly better in the oversubscribed HT regime on this machine.
hadrien@linux-2ak3:~/Bureau/context-switch/target/release> cargo +nightly build --release && /usr/bin/time -v ./async-brigade && /usr/bin/time -v ./thread-brigade && /usr/bin/time -v parallel ::: ./async-brigade ./async-brigade && /usr/bin/time -v parallel ::: ./thread-brigade ./thread-brigade && /usr/bin/time -v parallel ::: ./async-brigade ./async-brigade ./async-brigade ./async-brigade && /usr/bin/time -v parallel ::: ./thread-brigade ./thread-brigade ./thread-brigade ./thread-brigade && /usr/bin/time -v parallel ::: ./async-brigade ./async-brigade ./async-brigade ./async-brigade ./async-brigade ./async-brigade ./async-brigade ./async-brigade && /usr/bin/time -v parallel ::: ./thread-brigade ./thread-brigade ./thread-brigade ./thread-brigade ./thread-brigade ./thread-brigade ./thread-brigade ./thread-brigade Finished release [optimized] target(s) in 0.01s
10000 iterations, 500 tasks, mean 1.855ms per iteration, stddev 36.652µs
Command being timed: "./async-brigade"
User time (seconds): 10.59
System time (seconds): 8.15
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:18.75
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 2352
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 168
Voluntary context switches: 1
Involuntary context switches: 4690
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
10000 iterations, 500 tasks, mean 4.866ms per iteration, stddev 301.105µs
Command being timed: "./thread-brigade"
User time (seconds): 15.25
System time (seconds): 49.48
Percent of CPU this job got: 131%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:49.29
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 13368
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 1363
Voluntary context switches: 5060606
Involuntary context switches: 1572
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
10000 iterations, 500 tasks, mean 1.847ms per iteration, stddev 46.360µs
10000 iterations, 500 tasks, mean 1.872ms per iteration, stddev 79.085µs
Command being timed: "parallel ::: ./async-brigade ./async-brigade"
User time (seconds): 21.71
System time (seconds): 16.45
Percent of CPU this job got: 190%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:20.03
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 23308
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 10
Minor (reclaiming a frame) page faults: 31809
Voluntary context switches: 369
Involuntary context switches: 9703
Swaps: 0
File system inputs: 9184
File system outputs: 2624
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
10000 iterations, 500 tasks, mean 4.847ms per iteration, stddev 247.975µs
10000 iterations, 500 tasks, mean 4.858ms per iteration, stddev 268.127µs
Command being timed: "parallel ::: ./thread-brigade ./thread-brigade"
User time (seconds): 26.70
System time (seconds): 97.83
Percent of CPU this job got: 252%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:49.30
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 18320
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 12037
Voluntary context switches: 10121370
Involuntary context switches: 3533
Swaps: 0
File system inputs: 0
File system outputs: 3400
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
10000 iterations, 500 tasks, mean 1.930ms per iteration, stddev 86.004µs
10000 iterations, 500 tasks, mean 1.958ms per iteration, stddev 121.516µs
10000 iterations, 500 tasks, mean 1.964ms per iteration, stddev 197.596µs
10000 iterations, 500 tasks, mean 2.010ms per iteration, stddev 199.950µs
Command being timed: "parallel ::: ./async-brigade ./async-brigade ./async-brigade ./async-brigade"
User time (seconds): 43.46
System time (seconds): 36.37
Percent of CPU this job got: 388%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:20.56
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 18320
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 14000
Voluntary context switches: 156
Involuntary context switches: 21729
Swaps: 0
File system inputs: 0
File system outputs: 2952
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
10000 iterations, 500 tasks, mean 4.647ms per iteration, stddev 162.274µs
10000 iterations, 500 tasks, mean 4.655ms per iteration, stddev 163.901µs
10000 iterations, 500 tasks, mean 4.661ms per iteration, stddev 168.954µs
10000 iterations, 500 tasks, mean 4.660ms per iteration, stddev 173.542µs
Command being timed: "parallel ::: ./thread-brigade ./thread-brigade ./thread-brigade ./thread-brigade"
User time (seconds): 39.11
System time (seconds): 174.65
Percent of CPU this job got: 451%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:47.34
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 18300
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 18716
Voluntary context switches: 20242611
Involuntary context switches: 5957
Swaps: 0
File system inputs: 0
File system outputs: 3560
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
10000 iterations, 500 tasks, mean 3.476ms per iteration, stddev 50.389µs
10000 iterations, 500 tasks, mean 3.511ms per iteration, stddev 189.990µs
10000 iterations, 500 tasks, mean 3.526ms per iteration, stddev 324.798µs
10000 iterations, 500 tasks, mean 3.603ms per iteration, stddev 113.295µs
10000 iterations, 500 tasks, mean 3.604ms per iteration, stddev 572.686µs
10000 iterations, 500 tasks, mean 3.612ms per iteration, stddev 137.070µs
10000 iterations, 500 tasks, mean 3.620ms per iteration, stddev 292.860µs
10000 iterations, 500 tasks, mean 3.666ms per iteration, stddev 570.292µs
Command being timed: "parallel ::: ./async-brigade ./async-brigade ./async-brigade ./async-brigade ./async-brigade ./async-brigade ./async-brigade ./async-brigade"
User time (seconds): 130.36
System time (seconds): 155.92
Percent of CPU this job got: 770%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:37.18
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 18408
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 22630
Voluntary context switches: 249
Involuntary context switches: 89419
Swaps: 0
File system inputs: 0
File system outputs: 4128
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
10000 iterations, 500 tasks, mean 3.713ms per iteration, stddev 229.816µs
10000 iterations, 500 tasks, mean 3.726ms per iteration, stddev 224.892µs
10000 iterations, 500 tasks, mean 3.738ms per iteration, stddev 219.007µs
10000 iterations, 500 tasks, mean 3.742ms per iteration, stddev 256.988µs
10000 iterations, 500 tasks, mean 3.748ms per iteration, stddev 257.293µs
10000 iterations, 500 tasks, mean 3.749ms per iteration, stddev 229.886µs
10000 iterations, 500 tasks, mean 3.751ms per iteration, stddev 225.626µs
10000 iterations, 500 tasks, mean 3.760ms per iteration, stddev 270.486µs
Command being timed: "parallel ::: ./thread-brigade ./thread-brigade ./thread-brigade ./thread-brigade ./thread-brigade ./thread-brigade ./thread-brigade ./thread-brigade"
User time (seconds): 57.31
System time (seconds): 239.67
Percent of CPU this job got: 775%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:38.28
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 18324
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 32009
Voluntary context switches: 40483621
Involuntary context switches: 53429
Swaps: 0
File system inputs: 0
File system outputs: 3688
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Since the serial execution timings of the async version don't change very much wrt my older i3, and only the threaded version is suffering a lot, this is probably not caused by differences in microarchitecture, cache capacity, or memory bus between these two systems (as I can't think of a reason why these differences would only affect the threaded version, let alone in a highly negative way).
Therefore, by elimination, core count seems like the only variable which varies enough to qualify as a possible explanation to this surprising behavior. This would also be consistent with number of concurrent processes significantly affecting the results.
My insufficiently tested hypothesis is that in "serial" runs of the threaded version, the kernel might try to spread the threads across physical CPU cores for parallelism, which would be detrimental for this benchmark because we're just increasing inter-core communication for no actual parallel speedup benefit.
In the oversubscribed regime, however, the kernel would be content with how the CPU cores are filled up and refrains from migrating threads away from the CPU core in which they were created, therefore avoiding this detrimental cross-core communication effect.
This hypothesis could be tested by tracing kernel scheduler task migration events, but my perf-fu is not strong enough for that.
Another way to test this would be to run this benchmark on a system where useless parallelism is even more costly (e.g. multi-socket system, or one of those high-end AMD chips before Zen 2 which had multiple NUMA domains per chip).
EDIT: Aaand... I just found a third way that's easy for me to test: just force the threaded version to execute on a single CPU core with taskset, and suddenly it runs much faster, even faster than the async version:
hadrien@linux-2ak3:~/Bureau/context-switch/target/release> cargo +nightly build --release && /usr/bin/time -v ./async-brigade && taskset -c 0 /usr/bin/time -v ./thread-brigade
Finished release [optimized + debuginfo] target(s) in 0.01s
10000 iterations, 500 tasks, mean 1.817ms per iteration, stddev 71.822µs
Command being timed: "./async-brigade"
User time (seconds): 10.29
System time (seconds): 8.08
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:18.38
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 2284
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 163
Voluntary context switches: 1
Involuntary context switches: 4599
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
10000 iterations, 500 tasks, mean 1405.914µs per iteration, stddev 42.745µs
Command being timed: "./thread-brigade"
User time (seconds): 3.82
System time (seconds): 10.42
Percent of CPU this job got: 98%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:14.39
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 16456
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 1584
Voluntary context switches: 5059200
Involuntary context switches: 13563
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Conclusion: the reason why large amounts of IO threads perform so badly on this system, and will likely perform even worse on systems with higher core counts or more complicated core interconnect topologies, is that threads needlessly conflate concurrency and parallelism.
A purely IO-oriented OS threading abstraction like Windows NT fibers (EDIT: That was assuming that Windows fibers were well-integrated enough in the OS to yield to the user-mode scheduler on blocking system calls. After further research, it turns out they aren't.) should be able to perform much better here.