Best way to create Matrix of Matrices

ManuelCostanzo · August 26, 2020, 8:15pm

Hello, I need to create a matrix of matrices ([[[[f32; size]; size]; size]; size]) to access like matrix[blockI][blockJ][cellI][cellJ]. What is the most efficient way in memory to do this?

Then I would run through them with iterators

alice · August 26, 2020, 8:28pm

Is size a compile-time constant?

ManuelCostanzo · August 26, 2020, 8:28pm

Hello @alice, yes, it is a compile-time constant

alice · August 26, 2020, 8:35pm

There's a crate for doing it without unsafe, but here's the quick way.

use std::alloc::{Layout, alloc_zeroed};

const size: usize = 100;
type MyArray = [[[[f32; size]; size]; size]; size];

fn create_matrix() -> Box<MyArray> {
    unsafe {
        let layout = Layout::new::<MyArray>();
        let ptr = alloc_zeroed(layout);
        Box::from_raw(ptr as *mut MyArray)
    }
}

Trying to do it directly with Box::new runs into trouble with stack overflows.

alice · August 26, 2020, 8:39pm

You can also investigate the ndarray crate.

ManuelCostanzo · August 27, 2020, 3:41am

Great, thank you !

For some reason, ndarray loses a lot of performance when inserting a data that using box

s3bk · August 27, 2020, 7:58am

Almost all matrix/array libraries use fixed size arrays.
Inserting should be avoided at all cost as it definitely requires copying memory and often reallocating.

Array4<f32> should do the trick.

ManuelCostanzo · August 27, 2020, 12:56pm

Hello @s3bk,

If i have this code, do you think that i can improve the performance by using ndarray ? Because i can't do it.

type Matrix = [[f32; BLOCK_SIZE]; BLOCK_SIZE];

fn floyd_serial(ij: &mut Matrix, ik: Matrix, kj: Matrix) {
    kj.iter().enumerate().for_each(|(k, kj)| unsafe {
        zip(ij.iter_mut(), ik.iter().map(|ik| ik.get_unchecked(k))).for_each(|(ij, ik_cell)| {
            zip(ij.iter_mut(), kj.iter())
                .map(|(cell, kj_cell)| (ik_cell + *kj_cell, cell))
                .for_each(|(sum, cell)| {
                    if sum < *cell {
                        *cell = sum;
                    }
                });
        });
    });
}

The problem with ndarray is the *cell = sum, takes too long.

Obviously this is an "Array2" but this is the most consuming part of my code.

s3bk · August 27, 2020, 1:18pm

One factor is that your matrix size is fixed at compile time, where as ndarray uses variable size.

But sure, it is possible to handwrite code for a specific purpose that is faster than a general purpose library.

EDIT: I guess you may want to take the 2nd and 3rd argument by reference?

ManuelCostanzo · August 27, 2020, 1:20pm

Great, I understand.

Can I ask you what is your analysis of the code I passed ? This would be the equivalent code without iterators, do you think you could do some optimization ?

fn floyd_serial(ij: &mut Matrix, ik: Matrix, kj: Matrix) {
(0..BLOCK_SIZE).for_each(|k| unsafe {
    let kj = *kj.get_unchecked(k);
    (0..BLOCK_SIZE).for_each(|i| {
        let ik = ik.get_unchecked(i).get_unchecked(k);
        let row = ij.get_unchecked_mut(i);
        (0..BLOCK_SIZE).for_each(|j| {
            let sum = ik + kj.get_unchecked(j);
            let ij = row.get_unchecked_mut(j);
            if sum < *ij {
                *ij = sum;
            }
        });
    });
});

}

ManuelCostanzo · August 27, 2020, 1:21pm

Well, i need to modify only the first argument. Second and third are other Matrix that i don't need to modify

s3bk · August 27, 2020, 1:32pm

I would probably write it like this:

fn floyd_serial(ij: &mut Matrix, ik: Matrix, kj: Matrix) {
    for k in 0 .. BLOCK_SIZE {
        for i in 0 .. BLOCK_SIZE {
            for j in 0 .. BLOCK_SIZE {
                let sum = ik[i][k] + kj[k][j]
                let cell = &mut ij[i][j];
                if sum > *cell {
                    *cell = sum;
                }
            }
        }
    }
}

The bounds are known and the compiler can see that all your indices are always in bounds.
So I don't expect any bounds checks in the compiled code.

s3bk · August 27, 2020, 1:34pm

looks very clean

ManuelCostanzo · August 27, 2020, 1:43pm

Wow yes, this code is very clean. So when is it better to use iterators ? Why is it not necessary in this case ?

s3bk · August 27, 2020, 2:31pm

The main advantage for iterators on slices is that it avoids bounds checking twice.
They add complexity tho… which can trip up the optimizer.

Here the optimizer can eliminate the bounds checks, so iterators have no advantage.

ManuelCostanzo · August 27, 2020, 3:33pm

Excelent ! You and Alice helped me a lot.

Well I think this would be the best version. I suppose that no library as ndarray or nalgebra do something different that improves the performance

ManuelCostanzo · August 27, 2020, 4:13pm

assuming that each cell of the matrix has 2 elements, if you build a struct with 2 elements the performance is terrible. Should I use 2 different matrices ?

#[derive(Copy, Clone)]
struct A {
  cell: f32,
  path: i32,
}

type Matrix = [[A; BLOCK_SIZE]; BLOCK_SIZE];

s3bk · August 27, 2020, 4:20pm

Yes, split it up. SIMD works best with uniform data in memory.

ManuelCostanzo · August 27, 2020, 9:34pm

I'm sorry to keep this up. Here is the code, it works perfect. With a sample dataset it takes me ~1.06 seconds.

fn floyd_serial(ij: &mut Matrix, ik: Matrix, kj: Matrix) {
    for k in 0..BLOCK_SIZE {
        for i in 0..BLOCK_SIZE {
            for j in 0..BLOCK_SIZE {
                let sum = kj[k][j] + ik[i][k];
                let cell = &mut ij[i][j];
                if sum < *cell {
                    *cell = sum;
                }
            }
        }
    }
}

This other code adds a matrix and adds 600 milliseconds more (takes 1.6 seconds). (PMatrix is the same as Matrix but with i32 instead of f32).

fn floyd_serial(ij: &mut Matrix, ik: Matrix, kj: Matrix, path: &mut PMatrix, coord: i32) {
for k in 0..BLOCK_SIZE {
    for i in 0..BLOCK_SIZE {
        for j in 0..BLOCK_SIZE {
            let sum = kj[k][j] + ik[i][k];
            let cell = &mut ij[i][j];
            if sum < *cell {
                *cell = sum;
                path[i][j] = coord;
            }
        }
    }
}

}

If in both algorithms I comment on the sentence that it assigns, obviously my algorithm does not take anything, then the expensive part of this is in the assignment (the last sentence).

Is there any way to improve the time in the assignment ? I am using jemallocator and it improves the time. I compile it in the following way: RUSTFLAGS='-C target-cpu=native' cargo run --release. I have already configured

[profile.dev]
opt-level = 3

s3bk · August 28, 2020, 7:47am

That looks like a reasonable slowdown to me.

Topic		Replies	Views
Performance issue with C-array like computation (2 times worst than naive java) help	48	6536	January 12, 2023
Why are cartesian iterators slower than nested fors?	19	1812	August 20, 2020
Looking for help understanding Rust's performance vs C++ community	28	8650	November 1, 2019
How to parallelize this code? help	21	1428	December 28, 2020
Very fast initialization of a Vec of Vecs help	40	9673	July 22, 2020

Best way to create Matrix of Matrices

Related topics