Mutex locking alternatives in hot data parallel loop

I am using rayon to achieve data parallelism in a very hot loop. However, after each thread performs its expensive work, the result of the computation needs to be assembled in a common data structure. My code is essentially

let assembly = Arc::new(Mutex::new(Assembly::new()));

data.par_iter().for_each(|datum| {
    let result = expensive_computation(datum);

    let mut guard = assembly.lock().unwrap();
    guard.assemble_result(result);
});

When I watch my CPU utilization when the program is running, I am not seeing all CPUs 100% busy as I expect. I am realizing now that the majority of threads are blocked waiting to acquire the lock. How can I speed up this expensive portion of my code? Would an async mutex or a lockless data structure fix this problem?

You don't need the mutex if you don't have concurrent access.

let results: Vec<_> = data.par_iter()
    .map(expensive_computation)
    .collect();

let mut assembly = Assembly::new();

for res in results {
    assembly.assemble_result(res);
}

It's commonly called as map reduce pattern.

Thanks. I was trying to avoid calling collect since it would allocate a large temporary data structure.

It is a space-time tradeoff. Also you already have large structure on data so increased memory consumption will be linear at worst.

If you have a way to merge them, you could fold into parallel Assemblys and then reduce them down to one.

2 Likes

Another option is to send the results to a channel, then have another thread read the channel and assemble the results. This never blocks and it will only allocate significantly if the processing threads are producing data faster than the assembly thread collects it.

But I agree with @cuviper that you should try folding first. Take a look at ParallelIterator's fold and reduce methods, and see if they can work for you. (Which one is best depends on the characteristics of the assemble_result operation, or how it can be adjusted to fit.) The advantage of this over other mechanisms is that it allows rayon to handle the scheduling of the assembly together with everything else it is doing, rather than blocking or using an independent thread.

Thanks for the recommendation. I'm kicking myself that didn't think to use fold or reduce earlier. I was able to shoehorn my problem into this paradigm.

The resulting solution still has certain threads pausing within rayon as they wait for the shared assembly to be passed their way, but removing the mutex lock was a huge win. My CPUs are now working at fully utilization in the expensive_computation part and at about 50% utilization in the assemble_result part.

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.