Parallelism crate being able to do array loop static scheduling?

Hi,

I'm trying to figure out how to efficiently parallelize numerical codes (usually doing iterations over arrays).

I tried with Rayon but my first results were disappointing when compared to C/OpenMP.

For example, I converted this code

for(long iter=0;iter<iterations*2;iter++) {
    #pragma omp parallel for
    for(long i=1;i<size-1;i++)
        data2[i] = FAC1*data1[i]+FAC2*(data1[i-1]+data1[i+1]);
    float* data_tmp = data1;
    data1 = data2;
    data2 = data_tmp;
}

into Rust like this:

for _iter in 0..iterations*2 {
    data_b[1..size-1].par_iter_mut().enumerate().for_each(
        |(i, x) : (usize, &mut f32)| {
            let k = i+1;
            *x= unsafe { FAC1*data_a.get_unchecked(k)+FAC2*(data_a.get_unchecked(k-1)+data_a.get_unchecked(k+1)) }
        }
    );
    std::mem::swap(&mut data_a, &mut data_b);
}

Serial execution gives the same performance (using no OpenMP and using iter_mut() instead of par_iter_mut()). See purple vs. orange curves.

But code as above, running with 8 threads, shows Rayon is almost useless in this case. See green vs. yellow curves.

Checking what happens inside Rayon, I understand that it chunks the array and schedules the chunks dynamically over the threads.

This should rather be similar to OpenMP with dynamic scheduling instead of the default static scheduling. Indeed the performance seems to be much more similar to this. See light blue vs. yellow curves.

My interpretation is, that what kills the performance is that scheduling the same array chunks on different CPU codes kills cache usage and also the scheduling has too much overhead.

Deciding to implement the parallelization manually (creating messy unsafe code using crossbeam spawn for threads and SyncUnsafeCell to allow parallel access to arrays) finally gives my competitive results. See green, blue and red curves.

Another complication I found here was, that using std::sync::barrier also has too much overhead, so to beat C I had to implement barriers myself, spinning on a std::sync:atomic. See blue vs. red curve.

So to be honest, I'm disappointed with Rayon as an OpenMP parallel for replacement. My conclusion would be, it can only sensibly be used if:

  • cache reuse between subsequent calls to par_iter().for_each() on the same arrays has no significant impact
  • and arrays are large enough that synchronization overhead is neglectable in relation to computing (overhead being much larger than with OpenMP)

So I was wondering if there was a different (widely used) toolkit to achieve what I want (in an easy way, without having to use unsafe code). I couldn't come up with anything, but there should be something, shouldn't it?