Parallelization advice

I have code whose pertinent structure is morally equivalent to something like:

fn doit(events: Vec<Event>) -> Vec<Data> {
    let mut accumulator = vec![0.0; A_FEW_THOUSAND];
    for event in events {
        // increment small fraction of `accumulator`'s elements
    }
    accumulator
}

The calculation performed in the loop is embarrassingly parallel, but the results are stored by mutating external state, so parallelizing it isn't trivial. I'm thinking of distributing the events between different processors, and then combining the resulting accumulators at the end. I was hoping to use rayon, but it's not clear how to express this idea in rayon.

Another approach would be to replace the loop with a map which converts each event into some sparse vector of modified elements, which would then be reduced into the dense result accumulator, but I suspect that allocating a sparse vector for each event will turn out to be significantly expensive.

Can you suggest a good approach to this problem? Any crates that might be well suited to it?

par_iter_mut from rayon should do it.

If your computation only touches one element in the accumulator you could implement this as events.into_par_iter().map(...).collect() using rayon.

Otherwise if your computation will touch multiple elements I'd use ParallelIterator::fold() instead of map(), where the identity parameter creates a new empty accumulator. Then you'll chain on a reduce() to combine each of the intermediate accumulators into your final result.

It should only create one accumulator per worker thread so memory usage shouldn't be a problem, especially for a modern PC. If it ends up being a problem, you could choose a smarter sparse accumulator like a BTreeMap<usize, T> instead of a plain Vec<T>.

1 Like

IIUC, this allows mutation of the elements being iterated.

In my case, the iteration is over the events, which should not be mutated. It's the accumulator which should be mutated, and we are not iterating over the accumulator.

So I don't think par_iter_mut applies to my case. Have I misunderstood something?

par_iter_mut() will iterate over the items and let you mutate them (&mut T). You can also use par_iter() to iterate over references (&T) or into_par_iter() to consume the events by value.

All three adapters will transform your Vec<Event> into something that implements ParallelIterator.

Looks like I misunderstood.

So it looks like you don't know what part of accumulator will be modified for each event.
Atomics maybe?

No, it touches many, though still a small fraction of the total. A typical number would be 40 out of 4000.

That seems to have worked reasonably well: ✕2.5 speedup at around 4 threads.

Any hints on how to profile this or otherwise understand how it might be optimized? rayon turns flamegraphs into quite a spectacular mess!