Using rayon's parallel iterators for ns-level tasks such as zeroing an array results in poor performance due to scheduling costs.
The suggested solution is to use with_min_len to reduce the number of jobs, but I couldn't find any off-the-shelf suggestion on what one should pass as an argument. I understand that some experimentation can help, but it is impractical to perform experiments for each case, and anyway a change in architecture or compiler might invalidate the experiment results.
It is best to measure, but if you can't, then for an approximation I suggest splitting the work so that the number of tasks to process is a 2x-4x multiple of the number of CPU cores available.
Rayon won't be able to run more tasks at the same time than the number of CPU cores, so splitting above that number starts adding overhead. But the workload and work stealing may not be perfectly balanced, so it's better to have tasks subdivided a bit further, so that the very last task being executed (single threaded) can't take too long.
That is what I would do, too. Maybe a little trait depending on ExactSizeIterator and providing something like par_iter_auto could be a good idea.
The point is not to have something super refined—there's with_min_len for that—, but rather have a very easy-to-use, parameter free alternative method that will work better than naked par_iter in 99% of the cases.
BTW, I found very confusing information looking around for solutions. Some articles claim that rayon will aggregate small pieces of work in sequential chunks dynamically and automatically, but I'm a bit perplexed about how this can happen. Other sources claim adamantly that each job is exactly the callback provided.
Note that besides rayon's scheduling behavior, another consideration for “ns-level tasks” is the actual code executed. The compiler can optimize an ordinary iterator for or for_each using techniques like unrolling and vectorization, but (I believe; haven’t confirmed definitively) it can’t do the same thing to rayon iteration because the scheduling and splitting logic is always present.
So, you should also try having rayon give you chunks and iterating over those using a normal serial iterator.
After some experimentation, I can claim that with_min_len is essentially irrelevant for indexed iterators, unless you make it so big that it hurts parallelism.
I take back my previous claim "using rayon's parallel iterators for ns-level tasks such as zeroing an array results in poor performance due to scheduling costs". That was coming from some testing done by a student and I should know better.