Parallelism crate being able to do array loop static scheduling?

omgold · October 30, 2024, 10:15am

Hi,

I'm trying to figure out how to efficiently parallelize numerical codes (usually doing iterations over arrays).

I tried with Rayon but my first results were disappointing when compared to C/OpenMP.

For example, I converted this code

for(long iter=0;iter<iterations*2;iter++) {
    #pragma omp parallel for
    for(long i=1;i<size-1;i++)
        data2[i] = FAC1*data1[i]+FAC2*(data1[i-1]+data1[i+1]);
    float* data_tmp = data1;
    data1 = data2;
    data2 = data_tmp;
}

into Rust like this:

for _iter in 0..iterations*2 {
    data_b[1..size-1].par_iter_mut().enumerate().for_each(
        |(i, x) : (usize, &mut f32)| {
            let k = i+1;
            *x= unsafe { FAC1*data_a.get_unchecked(k)+FAC2*(data_a.get_unchecked(k-1)+data_a.get_unchecked(k+1)) }
        }
    );
    std::mem::swap(&mut data_a, &mut data_b);
}

Serial execution gives the same performance (using no OpenMP and using iter_mut() instead of par_iter_mut()). See purple vs. orange curves.

But code as above, running with 8 threads, shows Rayon is almost useless in this case. See green vs. yellow curves.

Checking what happens inside Rayon, I understand that it chunks the array and schedules the chunks dynamically over the threads.

This should rather be similar to OpenMP with dynamic scheduling instead of the default static scheduling. Indeed the performance seems to be much more similar to this. See light blue vs. yellow curves.

My interpretation is, that what kills the performance is that scheduling the same array chunks on different CPU codes kills cache usage and also the scheduling has too much overhead.

Deciding to implement the parallelization manually (creating messy unsafe code using crossbeam spawn for threads and SyncUnsafeCell to allow parallel access to arrays) finally gives my competitive results. See green, blue and red curves.

Another complication I found here was, that using std::sync::barrier also has too much overhead, so to beat C I had to implement barriers myself, spinning on a std::sync:atomic. See blue vs. red curve.

So to be honest, I'm disappointed with Rayon as an OpenMP parallel for replacement. My conclusion would be, it can only sensibly be used if:

cache reuse between subsequent calls to par_iter().for_each() on the same arrays has no significant impact
and arrays are large enough that synchronization overhead is neglectable in relation to computing (overhead being much larger than with OpenMP)

So I was wondering if there was a different (widely used) toolkit to achieve what I want (in an easy way, without having to use unsafe code). I couldn't come up with anything, but there should be something, shouldn't it?

system · January 28, 2025, 10:15am

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Using rayon to implement some common parallel patterns help	2	1551	December 7, 2022
What is a Rust equivalent to Swift's parallel thread-setting? help	7	949	December 31, 2020
Multithreaded computation on one variable help	4	449	May 27, 2020
Rayon par_iter_mut slower than serial help	4	1332	November 23, 2020
How to parallelize this code? help	21	1413	December 28, 2020

Parallelism crate being able to do array loop static scheduling?

Related topics