Surpise: std::sync::mpsc::channel much slower in optimized release build on MacOS/intel

I've made a trivial example program simply generating consecutive numbers, sending them in a channel and accumulate them by the receiver thread. Motivation was to get a feeling for performance of different channel implementations (std::mpsc, crossbeam, ...). However I got strange results on MacOS intel where the release build is significantly slower (5x) with std::mpsc than the debug build.

Here is the code and a README.md containing my benchmark results for different platforms:

With std::mpsc now being based on the crossbeam code, I would also have expected a very similar runtime behaviour for both implementations, which apart from windows (!) does not seem to be the case.

Thanks for reading!

Nils

I can confirm that on Linux x86_64 I also see performance decrease.

$ ./target/debug/foo
std 36.688066ms
crossbeam 27.071189ms

$ ./target/release/foo
std 111.013399ms
crossbeam 8.227681ms

After quick test I believe that the culprit is using a bounded channel with very low capacity. When I raised capacity from 5 to 5000 I've noticed ~3.5x speed improvement when running std channel in release compared to the debug build.

1 Like

So maybe the optimized "number generator" part is too fast in release builds and triggers some sub optimal behavior in this case? Seems to depend very much on the machine running the code.

But why doesn't crossbeam show this behavior? Have the implementations derived again since that merge?

But you are right, increasing the bound does help a lot (even 100 gives a much better result). I should have tried this, too.

Thanks!

With a capacity of 1000 std::mpsc is faster than crossbeam on my machine.

It's probably a good idea to benchmark this again later with a realistic payload.

Thanks again for the feedback!

1 Like

Are all CPUs busy? If they are, a whole set of new problems appear. When a channel unblocks, and no CPU is idle, how soon does the read end get some CPU time? This depends on the operating system's scheduler and how the code unlocks a lock.

Consider trying parking_lot's fair queues, for comparison. That's better at avoiding starvation under heavy load, at the cost of a little more CPU time at each unlock.

1 Like

Hi John! From eight (virtual) cores, four are totally bored (Intel MacBook).

I'll give the parking_lot queues a try. Thanks for the hint.