Such unbalanced sizes are exactly why rayon's splitting is "adaptive" by default, even though that may oversplit. It's generally effective, but not all that smart.
I think in your case, (0..points.len()).par_bridge()...
might work well. The bridge will split into just enough jobs for the number of threads, then each will pick values from a shared Mutex<Iter>
. That's a bit of a chokepoint, but that shouldn't matter if the rest of the work is significant enough. Your step 2 accumulators will be 1:1 with those bridge jobs.
Another approach could do this without the parallel iterators at all, something like:
let index = AtomicUsize::new(0);
let accumulators = rayon::broadcast(|_context| {
let mut accumulator = init();
loop {
let i = index.fetch_add(1, Ordering::Relaxed);
if i >= points.len() {
break;
}
// do stuff with points[i]...
}
accumulator
});
// ... then merge the accumulators