what is the BLOCK_SIZE value you are using?
@elsuizo My BLOCK_SIZE is 128
Here is my code
with ndarray is slow???
extern crate ndarray;
use ndarray::prelude::*;
type Matrix<T> = Array2<T>;
pub fn floyd_serial(ij: &mut Matrix<f32>, ik: Matrix<f32>, kj: Matrix<f32>, path: &mut Matrix<bool>, flag: bool) {
for k in 0..BLOCK_SIZE {
for i in 0..BLOCK_SIZE {
for j in 0..BLOCK_SIZE {
let sum = kj[[k, j]] + ik[[i,k]];
let cell = &mut ij[[i, j]];
if sum < *cell {
*cell = sum;
path[[i, j]] = flag;
}
}
}
}
}
also i change the path Matrix to boolean
To test the code I really need a 4-dimensional array to be able to switch to the 2d matrix function. In ndarray if I use Array4, I cannot ask for array4[[i,j]], it forces me to ask for array4[[i1, j1, i2, j2]]. Is there any way to make an Array2<Array2<f32>> ? Because I can't initialize it
I know you asked about a parallel version on a different thread, but figured I'd leave this here since this is where the cleanest version of the serial version is. Here's a parallel version using rayon. For a BLOCK_SIZE of 128 its showing about 2-3x improvement on my quad core machines. For higher block sizes, the performance improvement quickly approaches 4x.
use rayon::prelude::*;
pub type MatrixSlice<T> = [[T; BLOCK_SIZE]];
pub type Matrix = Box<MatrixSlice<f32>>;
pub type PMatrix = Box<MatrixSlice<i32>>;
pub fn floyd(
ij: &mut MatrixSlice<f32>,
ik: &MatrixSlice<f32>,
kj: &MatrixSlice<f32>,
path: &mut MatrixSlice<i32>,
coord: i32,
) {
ij.into_par_iter()
.zip(ik)
.zip(path)
.for_each(|((ij_i, ik_i), path_i)| {
for k in 0..BLOCK_SIZE {
let kj_k = kj[k];
let ik_ik = ik_i[k];
for j in 0..BLOCK_SIZE {
let sum = kj_k[j] + ik_ik;
let cell = &mut ij_i[j];
if sum < *cell {
*cell = sum;
path_i[j] = coord;
}
}
}
})
}
And for fun, here's an ndarray version that seems to perform even a little better though its a little harder to follow
use ndarray::{Array2, Zip};
pub type Matrix = Array2<f32>;
pub type PMatrix = Array2<i32>;
pub fn floyd(ij: &mut Matrix, ik: &Matrix, kj: &Matrix, path: &mut PMatrix, coord: i32) {
Zip::from(ij.genrows_mut())
.and(ik.genrows())
.and(path.genrows_mut())
.par_apply(|mut ij_i, ik_i, mut path_i| {
Zip::from(kj.genrows()).and(ik_i).apply(|kj_k, ik_ik| {
Zip::from(&mut ij_i).and(kj_k).and(&mut path_i).apply(
|ij_ij, kj_kj, path_ij| {
let sum = kj_kj + ik_ik;
if sum < *ij_ij {
*ij_ij = sum;
*path_ij = coord;
}
},
)
})
})
}
For anyone that's interested. I experimented with adding target-cpu=native to this and on my fairly new laptop it made a huge difference for performance (~6x faster) because this problem can take advantage of simd. With native in particular, the parallel version was more sensitive to how the function is written. The zip like iterators for everything helped ensure it reached the fastest speed. Here's the fastest version I came up with. This version with cpu=native runs in 70Îźs on my machine compared to 1200Îźs for the original version compiled without cpu=native. Computers are fun
const BLOCK_SIZE: usize = 128;
use rayon::prelude::*;
pub type MatrixSlice<T> = [[T; BLOCK_SIZE]];
pub fn floyd(
ij: &mut MatrixSlice<f32>,
ik: &MatrixSlice<f32>,
kj: &MatrixSlice<f32>,
path: &mut MatrixSlice<i32>,
coord: i32,
) {
ij.into_par_iter()
.zip(ik)
.zip(path)
.for_each(|((ij_i, ik_i), path_i)| {
kj.iter().zip(ik_i.iter()).for_each(|(kj_k, ik_ik)| {
ij_i.iter_mut()
.zip(kj_k.iter())
.zip(path_i.iter_mut())
.for_each(|((ij_ij, kj_kj), path_ij)| {
let sum = kj_kj + ik_ik;
if sum < *ij_ij {
*ij_ij = sum;
*path_ij = coord;
}
})
})
})
}
Hello @drewkett,
This is my complete and fastest code.
I could not translate to ndarray the last part of the floyd().
If I used ndarray, I would reduce the code a lot because I could measure column k by doing graph.column(k), for example.
The idea of this algorithm is to process the matrix in blocks, to make it more efficient, so what is parallelized is in the floyd method.
Below is close to the best you can do. The trick with parallelizing code is to split the work into small enough chunks such that all cores can efficiently be used simultaneously, but no further. There is synchronization overhead to parallelizing code. In your example, you've just put par_iter_mut everywhere. At everyone of those locations, rayon has to split up the iterator into something that sends code chunks to different threads for work. The trouble is that doing that inside another parallel chunk will cause that parallelization code to effectively fight the already running parallelization for resources which isn't what you want. Instead, you basically can just initialize a paralleling iterator at the beginning of every chunk of code like I've done. All the par_iter_mut locations in my code are for iterators of length 128, which can easily be divided into segments of 32 for each of my four cores. So then each core runs their 32 items (which should all take roughly the same amount of time), and when all of them are done the program advances.
I quickly tried ndarray as well. The code looked a lot cleaner but it didn't perform quite as well for me. I think it has to do with the fact that the way your code works with fixed size arrays makes it easier for compilers to optimize, and potentially something about your code benefitting from using the stack for some data. I didn't spend a lot of time looking to see if I could close the gap though.
One other thing to point out with benchmarking this code. On my laptop, I pretty quickly hit thermal limits on the parallel version so I don't actually see anywhere near a 4x speed improvement due to cpu throttling. I would expect close to a 4x time improvement with this code on a 4 core server though.
fn floyd_parallel(ij: &mut Matrix, ik: Matrix, kj: Matrix, path: &mut PMatrix, coord: i32) {
ij.par_iter_mut()
.zip(&mut path[..])
.zip(&ik[..])
.for_each(|((ij_i, path_i), ik_i)| {
kj.iter().zip(ik_i.iter()).for_each(|(kj_k, ik_ik)| {
ij_i.iter_mut()
.zip(&mut path_i[..])
.zip(&kj_k[..])
.for_each(|((cell, path_ij), kj_kj)| {
let sum = kj_kj + ik_ik;
if sum < *cell {
*cell = sum;
*path_ij = coord;
}
})
})
})
}
fn floyd(graph: &mut Box<BlockMatrix>, path: &mut Box<PBlockMatrix>) {
(0..BLOCK_ELEMS).for_each(|k| {
let coord = (k * BLOCK_ELEMS) as i32;
let mut column_k = get_block_array(); //Columna k
let mut row_k = get_block_array(); // Fila K
//FASE DEPENDIENTE
let kk_mut = &mut graph[k][k];
floyd_parallel(kk_mut, *kk_mut, *kk_mut, &mut path[k][k], coord);
let kk = *kk_mut;
//FASE PARCIALMENTE DEPENDIENTE -- COLUMNA K
graph
.par_iter_mut()
.zip(&mut path[..])
.zip(&mut column_k[..])
.enumerate()
.for_each(|(id, ((col, col_path), item_k))| {
if id != k {
let col_k_view = col[k];
floyd_serial(&mut col[k], col_k_view, kk, &mut col_path[k], coord);
*item_k = col[k];
}
});
//FASE PARCIALMENTE DEPENDIENTE -- FILA K
graph[k]
.par_iter_mut()
.zip(&mut path[k][..])
.zip(&mut row_k[..])
.enumerate()
.for_each(|(id, ((row, row_path), item_k))| {
if id != k {
floyd_serial(row, kk, *row, row_path, coord);
*item_k = *row;
}
});
//FASE INDEPENDIENTE
graph
.par_iter_mut()
.zip(&mut path[..])
.zip(&column_k[..])
.enumerate()
.for_each(|(id, ((rows, rows_path), ik))| {
if id != k {
rows.iter_mut()
.zip(rows_path.iter_mut())
.zip(row_k.iter())
.enumerate()
.for_each(|(id, ((ij, ij_path), kj))| {
if id != k {
floyd_serial(ij, *ik, *kj, ij_path, coord);
}
});
}
});
});
}
I agree with all your conclusions, the exact same thing happened to me.
It seems to me that the focus is not on the iterators, but on updating the matrix. If I do not update anything (that is to say, I comment the inside of the if sum < *cell), with N=8192 and BLOCK_SIZE=128 the program takes ~150ms. If I decomment *cell = sum the program takes ~1.05sec and if I also decomment *path_ij = coord me it takes 1.7sec.
These two assignment lines are the ones that delay the algorithm.
I have the same algorithm in C that uses openmp simd and using __builtin_expect (compiling with ICC), the program takes ~1.15sec. There is a limitation or problem in the assignments that I am not realizing.
Well without assignment, the program doesnât really do much so it makes sense that the time would go way down. If you never assign anywhere the program would be able to just return as soon as it starts since it doesnât actually need to do anything to get the desired outcome.
The fact that the icc compiler performs better doesnât totally surprise me in that I suspect itâs doing a better job of pushing more of it to simd. Have you checked to see if it outputs the same results though for both graph and path?
I suspect there might be something as well with memory management that could be done a bit more efficiently but thatâs more of a hunch than anything else.
Yes, I compare the binaries and it is the same resulting matrix.
The strange thing about the C program is that if I remove the path matrix, it is slower than Rust (yes, a C matrix vs 2 Rust matrices). When I add the C-array, it ends up giving much better, it is too strange.
In turn, if in C I take away __builtin_expect, it also gets much worse (this is because the if will usually give false)
I just tried it, removing __builtin_expect, C takes twice as long as Rust
This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.