Parallelization advice

jacg · December 3, 2020, 4:33pm

I have code whose pertinent structure is morally equivalent to something like:

fn doit(events: Vec<Event>) -> Vec<Data> {
    let mut accumulator = vec![0.0; A_FEW_THOUSAND];
    for event in events {
        // increment small fraction of `accumulator`'s elements
    }
    accumulator
}

The calculation performed in the loop is embarrassingly parallel, but the results are stored by mutating external state, so parallelizing it isn't trivial. I'm thinking of distributing the events between different processors, and then combining the resulting accumulators at the end. I was hoping to use rayon, but it's not clear how to express this idea in rayon.

Another approach would be to replace the loop with a map which converts each event into some sparse vector of modified elements, which would then be reduced into the dense result accumulator, but I suspect that allocating a sparse vector for each event will turn out to be significantly expensive.

Can you suggest a good approach to this problem? Any crates that might be well suited to it?

s3bk · December 3, 2020, 4:36pm

par_iter_mut from rayon should do it.

Michael-F-Bryan · December 3, 2020, 4:47pm

If your computation only touches one element in the accumulator you could implement this as events.into_par_iter().map(...).collect() using rayon.

Otherwise if your computation will touch multiple elements I'd use ParallelIterator::fold() instead of map(), where the identity parameter creates a new empty accumulator. Then you'll chain on a reduce() to combine each of the intermediate accumulators into your final result.

It should only create one accumulator per worker thread so memory usage shouldn't be a problem, especially for a modern PC. If it ends up being a problem, you could choose a smarter sparse accumulator like a BTreeMap<usize, T> instead of a plain Vec<T>.

jacg · December 3, 2020, 5:35pm

IIUC, this allows mutation of the elements being iterated.

In my case, the iteration is over the events, which should not be mutated. It's the accumulator which should be mutated, and we are not iterating over the accumulator.

So I don't think par_iter_mut applies to my case. Have I misunderstood something?

Michael-F-Bryan · December 3, 2020, 5:39pm

par_iter_mut() will iterate over the items and let you mutate them (&mut T). You can also use par_iter() to iterate over references (&T) or into_par_iter() to consume the events by value.

All three adapters will transform your Vec<Event> into something that implements ParallelIterator.

s3bk · December 3, 2020, 6:13pm

Looks like I misunderstood.

So it looks like you don't know what part of accumulator will be modified for each event.
Atomics maybe?

jacg · December 3, 2020, 10:40pm

No, it touches many, though still a small fraction of the total. A typical number would be 40 out of 4000.

That seems to have worked reasonably well: ✕2.5 speedup at around 4 threads.

Any hints on how to profile this or otherwise understand how it might be optimized? rayon turns flamegraphs into quite a spectacular mess!

system · March 3, 2021, 10:41pm

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Rayon sequential dependency between parallel calculations	6	1267	May 16, 2021
How to optimize this code pattern with rayon? help	3	418	January 16, 2024
How to parallelize this code? help	21	1423	December 28, 2020
Is there some way to convert rayon parallel iterator back to sequential iterator? help	3	1895	November 24, 2019
Rayon & Packed SIMD accumulator pattern help	2	548	March 17, 2021

Parallelization advice

Related topics