Best way to create Matrix of Matrices

elsuizo · August 28, 2020, 6:56pm

what is the BLOCK_SIZE value you are using?

ManuelCostanzo · August 28, 2020, 7:03pm

@elsuizo My BLOCK_SIZE is 128

ManuelCostanzo · August 28, 2020, 7:12pm

Here is my code

elsuizo · August 28, 2020, 9:17pm

with ndarray is slow???

extern crate ndarray;
use ndarray::prelude::*;

type Matrix<T> = Array2<T>;

pub fn floyd_serial(ij: &mut Matrix<f32>, ik: Matrix<f32>, kj: Matrix<f32>, path: &mut Matrix<bool>, flag: bool) {
    for k in 0..BLOCK_SIZE {
        for i in 0..BLOCK_SIZE {
            for j in 0..BLOCK_SIZE {
                let sum = kj[[k, j]] + ik[[i,k]];
                let cell = &mut ij[[i, j]];
                if sum < *cell {
                    *cell = sum;
                    path[[i, j]] = flag;
                }
            }
        }
    }
}

also i change the path Matrix to boolean

ManuelCostanzo · August 29, 2020, 1:12am

To test the code I really need a 4-dimensional array to be able to switch to the 2d matrix function. In ndarray if I use Array4, I cannot ask for array4[[i,j]], it forces me to ask for array4[[i1, j1, i2, j2]]. Is there any way to make an Array2<Array2<f32>> ? Because I can't initialize it

drewkett · August 29, 2020, 3:50am

I know you asked about a parallel version on a different thread, but figured I'd leave this here since this is where the cleanest version of the serial version is. Here's a parallel version using rayon. For a BLOCK_SIZE of 128 its showing about 2-3x improvement on my quad core machines. For higher block sizes, the performance improvement quickly approaches 4x.

    use rayon::prelude::*;
    pub type MatrixSlice<T> = [[T; BLOCK_SIZE]];
    pub type Matrix = Box<MatrixSlice<f32>>;
    pub type PMatrix = Box<MatrixSlice<i32>>;

    pub fn floyd(
        ij: &mut MatrixSlice<f32>,
        ik: &MatrixSlice<f32>,
        kj: &MatrixSlice<f32>,
        path: &mut MatrixSlice<i32>,
        coord: i32,
    ) {
        ij.into_par_iter()
            .zip(ik)
            .zip(path)
            .for_each(|((ij_i, ik_i), path_i)| {
                for k in 0..BLOCK_SIZE {
                    let kj_k = kj[k];
                    let ik_ik = ik_i[k];
                    for j in 0..BLOCK_SIZE {
                        let sum = kj_k[j] + ik_ik;
                        let cell = &mut ij_i[j];
                        if sum < *cell {
                            *cell = sum;
                            path_i[j] = coord;
                        }
                    }
                }
            })
    }

And for fun, here's an ndarray version that seems to perform even a little better though its a little harder to follow

    use ndarray::{Array2, Zip};
    pub type Matrix = Array2<f32>;
    pub type PMatrix = Array2<i32>;

    pub fn floyd(ij: &mut Matrix, ik: &Matrix, kj: &Matrix, path: &mut PMatrix, coord: i32) {
        Zip::from(ij.genrows_mut())
            .and(ik.genrows())
            .and(path.genrows_mut())
            .par_apply(|mut ij_i, ik_i, mut path_i| {
                Zip::from(kj.genrows()).and(ik_i).apply(|kj_k, ik_ik| {
                    Zip::from(&mut ij_i).and(kj_k).and(&mut path_i).apply(
                        |ij_ij, kj_kj, path_ij| {
                            let sum = kj_kj + ik_ik;
                            if sum < *ij_ij {
                                *ij_ij = sum;
                                *path_ij = coord;
                            }
                        },
                    )
                })
            })
    }

drewkett · August 29, 2020, 4:07pm

For anyone that's interested. I experimented with adding target-cpu=native to this and on my fairly new laptop it made a huge difference for performance (~6x faster) because this problem can take advantage of simd. With native in particular, the parallel version was more sensitive to how the function is written. The zip like iterators for everything helped ensure it reached the fastest speed. Here's the fastest version I came up with. This version with cpu=native runs in 70μs on my machine compared to 1200μs for the original version compiled without cpu=native. Computers are fun

    const BLOCK_SIZE: usize = 128;
    use rayon::prelude::*;
    pub type MatrixSlice<T> = [[T; BLOCK_SIZE]];
    pub fn floyd(
        ij: &mut MatrixSlice<f32>,
        ik: &MatrixSlice<f32>,
        kj: &MatrixSlice<f32>,
        path: &mut MatrixSlice<i32>,
        coord: i32,
    ) {
        ij.into_par_iter()
            .zip(ik)
            .zip(path)
            .for_each(|((ij_i, ik_i), path_i)| {
                kj.iter().zip(ik_i.iter()).for_each(|(kj_k, ik_ik)| {
                    ij_i.iter_mut()
                        .zip(kj_k.iter())
                        .zip(path_i.iter_mut())
                        .for_each(|((ij_ij, kj_kj), path_ij)| {
                            let sum = kj_kj + ik_ik;
                            if sum < *ij_ij {
                                *ij_ij = sum;
                                *path_ij = coord;
                            }
                        })
                })
            })
    }

ManuelCostanzo · August 30, 2020, 10:43pm

Hello @drewkett,

This is my complete and fastest code.

I could not translate to ndarray the last part of the floyd().

If I used ndarray, I would reduce the code a lot because I could measure column k by doing graph.column(k), for example.

The idea of this algorithm is to process the matrix in blocks, to make it more efficient, so what is parallelized is in the floyd method.

drewkett · September 1, 2020, 2:00am

Below is close to the best you can do. The trick with parallelizing code is to split the work into small enough chunks such that all cores can efficiently be used simultaneously, but no further. There is synchronization overhead to parallelizing code. In your example, you've just put par_iter_mut everywhere. At everyone of those locations, rayon has to split up the iterator into something that sends code chunks to different threads for work. The trouble is that doing that inside another parallel chunk will cause that parallelization code to effectively fight the already running parallelization for resources which isn't what you want. Instead, you basically can just initialize a paralleling iterator at the beginning of every chunk of code like I've done. All the par_iter_mut locations in my code are for iterators of length 128, which can easily be divided into segments of 32 for each of my four cores. So then each core runs their 32 items (which should all take roughly the same amount of time), and when all of them are done the program advances.

I quickly tried ndarray as well. The code looked a lot cleaner but it didn't perform quite as well for me. I think it has to do with the fact that the way your code works with fixed size arrays makes it easier for compilers to optimize, and potentially something about your code benefitting from using the stack for some data. I didn't spend a lot of time looking to see if I could close the gap though.

One other thing to point out with benchmarking this code. On my laptop, I pretty quickly hit thermal limits on the parallel version so I don't actually see anywhere near a 4x speed improvement due to cpu throttling. I would expect close to a 4x time improvement with this code on a 4 core server though.

fn floyd_parallel(ij: &mut Matrix, ik: Matrix, kj: Matrix, path: &mut PMatrix, coord: i32) {
    ij.par_iter_mut()
        .zip(&mut path[..])
        .zip(&ik[..])
        .for_each(|((ij_i, path_i), ik_i)| {
            kj.iter().zip(ik_i.iter()).for_each(|(kj_k, ik_ik)| {
                ij_i.iter_mut()
                    .zip(&mut path_i[..])
                    .zip(&kj_k[..])
                    .for_each(|((cell, path_ij), kj_kj)| {
                        let sum = kj_kj + ik_ik;
                        if sum < *cell {
                            *cell = sum;
                            *path_ij = coord;
                        }
                    })
            })
        })
}

fn floyd(graph: &mut Box<BlockMatrix>, path: &mut Box<PBlockMatrix>) {
    (0..BLOCK_ELEMS).for_each(|k| {
        let coord = (k * BLOCK_ELEMS) as i32;
        let mut column_k = get_block_array(); //Columna k
        let mut row_k = get_block_array(); // Fila K

        //FASE DEPENDIENTE
        let kk_mut = &mut graph[k][k];
        floyd_parallel(kk_mut, *kk_mut, *kk_mut, &mut path[k][k], coord);
        let kk = *kk_mut;

        //FASE PARCIALMENTE DEPENDIENTE -- COLUMNA K
        graph
            .par_iter_mut()
            .zip(&mut path[..])
            .zip(&mut column_k[..])
            .enumerate()
            .for_each(|(id, ((col, col_path), item_k))| {
                if id != k {
                    let col_k_view = col[k];
                    floyd_serial(&mut col[k], col_k_view, kk, &mut col_path[k], coord);
                    *item_k = col[k];
                }
            });

        //FASE PARCIALMENTE DEPENDIENTE -- FILA K
        graph[k]
            .par_iter_mut()
            .zip(&mut path[k][..])
            .zip(&mut row_k[..])
            .enumerate()
            .for_each(|(id, ((row, row_path), item_k))| {
                if id != k {
                    floyd_serial(row, kk, *row, row_path, coord);
                    *item_k = *row;
                }
            });

        //FASE INDEPENDIENTE
        graph
            .par_iter_mut()
            .zip(&mut path[..])
            .zip(&column_k[..])
            .enumerate()
            .for_each(|(id, ((rows, rows_path), ik))| {
                if id != k {
                    rows.iter_mut()
                        .zip(rows_path.iter_mut())
                        .zip(row_k.iter())
                        .enumerate()
                        .for_each(|(id, ((ij, ij_path), kj))| {
                            if id != k {
                                floyd_serial(ij, *ik, *kj, ij_path, coord);
                            }
                        });
                }
            });
    });
}

ManuelCostanzo · September 1, 2020, 3:26am

drewkett:

Below is close to the best you can do. The trick with parallelizing code is to split the work into small enough chunks such that all cores can efficiently be used simultaneously, but no further. There is synchronization overhead to parallelizing code. In your example, you've just put par_iter_mut everywhere. At everyone of those locations, rayon has to split up the iterator into something that sends code chunks to different threads for work. The trouble is that doing that inside another parallel chunk will cause that parallelization code to effectively fight the already running parallelization for resources which isn't what you want. Instead, you basically can just initialize a paralleling iterator at the beginning of every chunk of code like I've done. All the par_iter_mut locations in my code are for iterators of length 128, which can easily be divided into segments of 32 for each of my four cores. So then each core runs their 32 items (which should all take roughly the same amount of time), and when all of them are done the program advances.

I quickly tried ndarray as well. The code looked a lot cleaner but it didn't perform quite as well for me. I think it has to do with the fact that the way your code works with fixed size arrays makes it easier for compilers to optimize, and potentially something about your code benefitting from using the stack for some data. I didn't spend a lot of time looking to see if I could close the gap though.

One other thing to point out with benchmarking this code. On my laptop, I pretty quickly hit thermal limits on the parallel version so I don't actually see anywhere near a 4x speed improvement due to cpu throttling. I would expect close to a 4x time improvement with this code on a 4 core server though.

I agree with all your conclusions, the exact same thing happened to me.

It seems to me that the focus is not on the iterators, but on updating the matrix. If I do not update anything (that is to say, I comment the inside of the if sum < *cell), with N=8192 and BLOCK_SIZE=128 the program takes ~150ms. If I decomment *cell = sum the program takes ~1.05sec and if I also decomment *path_ij = coord me it takes 1.7sec.

These two assignment lines are the ones that delay the algorithm.

I have the same algorithm in C that uses openmp simd and using __builtin_expect (compiling with ICC), the program takes ~1.15sec. There is a limitation or problem in the assignments that I am not realizing.

drewkett · September 1, 2020, 3:34am

Well without assignment, the program doesn’t really do much so it makes sense that the time would go way down. If you never assign anywhere the program would be able to just return as soon as it starts since it doesn’t actually need to do anything to get the desired outcome.

The fact that the icc compiler performs better doesn’t totally surprise me in that I suspect it’s doing a better job of pushing more of it to simd. Have you checked to see if it outputs the same results though for both graph and path?

I suspect there might be something as well with memory management that could be done a bit more efficiently but that’s more of a hunch than anything else.

ManuelCostanzo · September 1, 2020, 3:41am

Yes, I compare the binaries and it is the same resulting matrix.

The strange thing about the C program is that if I remove the path matrix, it is slower than Rust (yes, a C matrix vs 2 Rust matrices). When I add the C-array, it ends up giving much better, it is too strange.

ManuelCostanzo · September 1, 2020, 3:43am

In turn, if in C I take away __builtin_expect, it also gets much worse (this is because the if will usually give false)

ManuelCostanzo · September 1, 2020, 3:55am

I just tried it, removing __builtin_expect, C takes twice as long as Rust

system · November 30, 2020, 3:55am

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Performance issue with C-array like computation (2 times worst than naive java) help	48	6536	January 12, 2023
Why are cartesian iterators slower than nested fors?	19	1812	August 20, 2020
Looking for help understanding Rust's performance vs C++ community	28	8651	November 1, 2019
How to parallelize this code? help	21	1428	December 28, 2020
Very fast initialization of a Vec of Vecs help	40	9673	July 22, 2020

Best way to create Matrix of Matrices

Related topics