I've made a trivial example program simply generating consecutive numbers, sending them in a channel and accumulate them by the receiver thread. Motivation was to get a feeling for performance of different channel implementations (std::mpsc, crossbeam, ...). However I got strange results on MacOS intel where the release build is significantly slower (5x) with std::mpsc than the debug build.
Here is the code and a README.md containing my benchmark results for different platforms:
With std::mpsc now being based on the crossbeam code, I would also have expected a very similar runtime behaviour for both implementations, which apart from windows (!) does not seem to be the case.
After quick test I believe that the culprit is using a bounded channel with very low capacity. When I raised capacity from 5 to 5000 I've noticed ~3.5x speed improvement when running std channel in release compared to the debug build.
So maybe the optimized "number generator" part is too fast in release builds and triggers some sub optimal behavior in this case? Seems to depend very much on the machine running the code.
But why doesn't crossbeam show this behavior? Have the implementations derived again since that merge?
But you are right, increasing the bound does help a lot (even 100 gives a much better result). I should have tried this, too.
Are all CPUs busy? If they are, a whole set of new problems appear. When a channel unblocks, and no CPU is idle, how soon does the read end get some CPU time? This depends on the operating system's scheduler and how the code unlocks a lock.
Consider trying parking_lot's fair queues, for comparison. That's better at avoiding starvation under heavy load, at the cost of a little more CPU time at each unlock.