Rayon par_iter_mut slower than serial

ManuelCostanzo · August 24, 2020, 10:24pm

Hello ! I have a problem and I need to ask for help.

I have the following code:

fn floyd_parallel(graph: &mut Matrix) {
    (0..BLOCK_SIZE).for_each(|k| unsafe {
        let row_k = *graph.get_unchecked(k);
        let column_k = get_column(&graph, k);
        graph.par_iter_mut().enumerate().for_each(|(i, rows)| {
            let ik = *column_k.get_unchecked(i);

            rows.par_iter_mut()
                .zip(row_k.par_iter())
                .for_each(|(ij, kj)| {
                    let sum = ik + *kj;
                    if sum < *ij {
                        *ij = sum;
                    }
                });
        });
    });

My problem is that if I remove the last par_iter_mut, the program runs faster. But the incredible thing is that if I replace all the par_iter_mut with iter_mut, the code is much faster (almost double).

Why does this happen ? What am I missing ? This logic using Openmp in C works perfect, but with rayon I'm having problems.

I cant paralelize k iteration because each iteration depends on the next one.

Thank you very much to everyone.

kornel · August 24, 2020, 11:08pm

Having two nested for_each forces the inner iterator to completely finish before proceeding to the next element in the outer for_each.

You're probably paying setup cost of rows.par_iter_mut() over and over again, and within each row there's too little work to do, so it doesn't even run for long enough for other threads to pick any work from the innner parallel iterator.

If graph.par_iter_mut() has more elements than you have CPU cores, that's all you need.

Maybe instead of using two nested for_each, try flat_map and one for_each?

Another problem is false sharing. If two threads write to adjacent locations in memory (like two elements in the same row), that forces CPU to synchronize caches, and that's super slow.

ManuelCostanzo · August 25, 2020, 12:02am

Hello , Kornel

Thank you. You are right, but I have a big doubt with this:

These four sequential algorithms are equivalent. But I don't understand why the fastest one is the first one. I understand that Rust advises to use iterators, why the first one is the best ?

Algorithm 1

    (0..BLOCK_SIZE).for_each(|k| unsafe {
    let kj = *kj.get_unchecked(k);
    (0..BLOCK_SIZE).for_each(|i| {
        let ik = ik.get_unchecked(i).get_unchecked(k);
        let row = ij.get_unchecked_mut(i);
        (0..BLOCK_SIZE).for_each(|j| {
            let sum = ik + kj.get_unchecked(j);
            let ij = row.get_unchecked_mut(j);
            if sum < *ij {
                *ij = sum;
            }
        });
    });
});

Alrogithm 2

kj.iter().enumerate().for_each(|(k, kj)| unsafe {
    ij.iter_mut().zip(ik.iter()).for_each(|(ij, ik)| {
        let ik = *ik.get_unchecked(k);
        ij.iter_mut().zip(kj.iter()).for_each(|(ij, kj)| {
            let sum = ik + *kj;
            if sum < *ij {
                *ij = sum;
            }
        });
    });
});

Algorithm 3

(0..BLOCK_SIZE).for_each(|k| unsafe {
    let row_k = *kj.get_unchecked(k);
    ij.iter_mut().enumerate().for_each(|(i, rows)| {
        rows.iter_mut()
            .zip(row_k.iter())
            .zip(ik.iter())
            .map(|((ij, kj), ik)| (*ik.get_unchecked(k) + *kj, ij))
            .filter(|(sum, ij)| sum < *ij)
            .for_each(|(sum, ij)| {
                *ij = sum;
            });
    });
});

Algorithm 4

fn floyd_serial(ij: &mut Matrix, ik: Matrix, kj: Matrix) {
(0..BLOCK_SIZE).for_each(|k| unsafe {
    let row_k = *kj.get_unchecked(k);
    ij.iter_mut().enumerate().for_each(|(i, rows)| {
        rows.iter_mut()
            .enumerate()
            .map(|(j, ij)| {
                (
                    *ik.get_unchecked(i).get_unchecked(k) + *row_k.get_unchecked(j),
                    ij,
                )
            })
            .filter(|(sum, ij)| sum < *ij)
            .for_each(|(sum, ij)| {
                *ij = sum;
            });
    });
});

}

drewkett · August 25, 2020, 3:01am

Do you have a runnable code sample that you can post?

Often these optimization questions end up being a combination of

the compiler not optimizing one version as well as another
the different versions aren't actually doing precisely the same thing which leads to either
- Extra operations in the slower versions
- Better memory/cache efficiencies due to iteration order differences

One way to drill down on it is to look at the resulting assembly from the compilation. Even if you don't have a good understanding of assembly sometimes you can infer some things just by diffing the output of two different compilations.

All that said, if you do have some runnable code you can post, it would let people poke around it on there own and not necessarily try to diagnose the issue by purely looking at code.

system · November 23, 2020, 3:01am

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Rayon is slower than serial algorithm	15	2159	October 30, 2020
Rayon par_iter() is always slower than iter()	5	1759	December 25, 2021
Why par_iter() is slower than iter() help	23	2451	January 2, 2024
Confusion about ```into_par_iter``` help	7	2911	January 26, 2020
Is rayon's parallel iterator slower than java? help	4	418	September 25, 2024

Rayon par_iter_mut slower than serial

Related topics