Rayon par_iter_mut slower than serial

Hello ! I have a problem and I need to ask for help.

I have the following code:

fn floyd_parallel(graph: &mut Matrix) {
    (0..BLOCK_SIZE).for_each(|k| unsafe {
        let row_k = *graph.get_unchecked(k);
        let column_k = get_column(&graph, k);
        graph.par_iter_mut().enumerate().for_each(|(i, rows)| {
            let ik = *column_k.get_unchecked(i);

            rows.par_iter_mut()
                .zip(row_k.par_iter())
                .for_each(|(ij, kj)| {
                    let sum = ik + *kj;
                    if sum < *ij {
                        *ij = sum;
                    }
                });
        });
    });

My problem is that if I remove the last par_iter_mut, the program runs faster. But the incredible thing is that if I replace all the par_iter_mut with iter_mut, the code is much faster (almost double).

Why does this happen ? What am I missing ? This logic using Openmp in C works perfect, but with rayon I'm having problems.

I cant paralelize k iteration because each iteration depends on the next one.

Thank you very much to everyone.

Having two nested for_each forces the inner iterator to completely finish before proceeding to the next element in the outer for_each.

You're probably paying setup cost of rows.par_iter_mut() over and over again, and within each row there's too little work to do, so it doesn't even run for long enough for other threads to pick any work from the innner parallel iterator.

If graph.par_iter_mut() has more elements than you have CPU cores, that's all you need.

Maybe instead of using two nested for_each, try flat_map and one for_each?


Another problem is false sharing. If two threads write to adjacent locations in memory (like two elements in the same row), that forces CPU to synchronize caches, and that's super slow.

Hello , Kornel

Thank you. You are right, but I have a big doubt with this:

These four sequential algorithms are equivalent. But I don't understand why the fastest one is the first one. I understand that Rust advises to use iterators, why the first one is the best ?

Algorithm 1

    (0..BLOCK_SIZE).for_each(|k| unsafe {
    let kj = *kj.get_unchecked(k);
    (0..BLOCK_SIZE).for_each(|i| {
        let ik = ik.get_unchecked(i).get_unchecked(k);
        let row = ij.get_unchecked_mut(i);
        (0..BLOCK_SIZE).for_each(|j| {
            let sum = ik + kj.get_unchecked(j);
            let ij = row.get_unchecked_mut(j);
            if sum < *ij {
                *ij = sum;
            }
        });
    });
});

Alrogithm 2

kj.iter().enumerate().for_each(|(k, kj)| unsafe {
    ij.iter_mut().zip(ik.iter()).for_each(|(ij, ik)| {
        let ik = *ik.get_unchecked(k);
        ij.iter_mut().zip(kj.iter()).for_each(|(ij, kj)| {
            let sum = ik + *kj;
            if sum < *ij {
                *ij = sum;
            }
        });
    });
});

Algorithm 3

(0..BLOCK_SIZE).for_each(|k| unsafe {
    let row_k = *kj.get_unchecked(k);
    ij.iter_mut().enumerate().for_each(|(i, rows)| {
        rows.iter_mut()
            .zip(row_k.iter())
            .zip(ik.iter())
            .map(|((ij, kj), ik)| (*ik.get_unchecked(k) + *kj, ij))
            .filter(|(sum, ij)| sum < *ij)
            .for_each(|(sum, ij)| {
                *ij = sum;
            });
    });
});

Algorithm 4

fn floyd_serial(ij: &mut Matrix, ik: Matrix, kj: Matrix) {
(0..BLOCK_SIZE).for_each(|k| unsafe {
    let row_k = *kj.get_unchecked(k);
    ij.iter_mut().enumerate().for_each(|(i, rows)| {
        rows.iter_mut()
            .enumerate()
            .map(|(j, ij)| {
                (
                    *ik.get_unchecked(i).get_unchecked(k) + *row_k.get_unchecked(j),
                    ij,
                )
            })
            .filter(|(sum, ij)| sum < *ij)
            .for_each(|(sum, ij)| {
                *ij = sum;
            });
    });
});

}

Do you have a runnable code sample that you can post?

Often these optimization questions end up being a combination of

  • the compiler not optimizing one version as well as another
  • the different versions aren't actually doing precisely the same thing which leads to either
    • Extra operations in the slower versions
    • Better memory/cache efficiencies due to iteration order differences

One way to drill down on it is to look at the resulting assembly from the compilation. Even if you don't have a good understanding of assembly sometimes you can infer some things just by diffing the output of two different compilations.

All that said, if you do have some runnable code you can post, it would let people poke around it on there own and not necessarily try to diagnose the issue by purely looking at code.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.