Hello ! I have a problem and I need to ask for help.
I have the following code:
fn floyd_parallel(graph: &mut Matrix) {
(0..BLOCK_SIZE).for_each(|k| unsafe {
let row_k = *graph.get_unchecked(k);
let column_k = get_column(&graph, k);
graph.par_iter_mut().enumerate().for_each(|(i, rows)| {
let ik = *column_k.get_unchecked(i);
rows.par_iter_mut()
.zip(row_k.par_iter())
.for_each(|(ij, kj)| {
let sum = ik + *kj;
if sum < *ij {
*ij = sum;
}
});
});
});
My problem is that if I remove the last par_iter_mut, the program runs faster. But the incredible thing is that if I replace all the par_iter_mut with iter_mut, the code is much faster (almost double).
Why does this happen ? What am I missing ? This logic using Openmp in C works perfect, but with rayon I'm having problems.
I cant paralelize k iteration because each iteration depends on the next one.
Having two nested for_each forces the inner iterator to completely finish before proceeding to the next element in the outer for_each.
You're probably paying setup cost of rows.par_iter_mut() over and over again, and within each row there's too little work to do, so it doesn't even run for long enough for other threads to pick any work from the innner parallel iterator.
If graph.par_iter_mut() has more elements than you have CPU cores, that's all you need.
Maybe instead of using two nested for_each, try flat_map and one for_each?
Another problem is false sharing. If two threads write to adjacent locations in memory (like two elements in the same row), that forces CPU to synchronize caches, and that's super slow.
Thank you. You are right, but I have a big doubt with this:
These four sequential algorithms are equivalent. But I don't understand why the fastest one is the first one. I understand that Rust advises to use iterators, why the first one is the best ?
Algorithm 1
(0..BLOCK_SIZE).for_each(|k| unsafe {
let kj = *kj.get_unchecked(k);
(0..BLOCK_SIZE).for_each(|i| {
let ik = ik.get_unchecked(i).get_unchecked(k);
let row = ij.get_unchecked_mut(i);
(0..BLOCK_SIZE).for_each(|j| {
let sum = ik + kj.get_unchecked(j);
let ij = row.get_unchecked_mut(j);
if sum < *ij {
*ij = sum;
}
});
});
});
Alrogithm 2
kj.iter().enumerate().for_each(|(k, kj)| unsafe {
ij.iter_mut().zip(ik.iter()).for_each(|(ij, ik)| {
let ik = *ik.get_unchecked(k);
ij.iter_mut().zip(kj.iter()).for_each(|(ij, kj)| {
let sum = ik + *kj;
if sum < *ij {
*ij = sum;
}
});
});
});
Do you have a runnable code sample that you can post?
Often these optimization questions end up being a combination of
the compiler not optimizing one version as well as another
the different versions aren't actually doing precisely the same thing which leads to either
Extra operations in the slower versions
Better memory/cache efficiencies due to iteration order differences
One way to drill down on it is to look at the resulting assembly from the compilation. Even if you don't have a good understanding of assembly sometimes you can infer some things just by diffing the output of two different compilations.
All that said, if you do have some runnable code you can post, it would let people poke around it on there own and not necessarily try to diagnose the issue by purely looking at code.