Best way to create Matrix of Matrices

Hello, I need to create a matrix of matrices ([[[[f32; size]; size]; size]; size]) to access like matrix[blockI][blockJ][cellI][cellJ]. What is the most efficient way in memory to do this?

Then I would run through them with iterators

Is size a compile-time constant?

Hello @alice, yes, it is a compile-time constant

There's a crate for doing it without unsafe, but here's the quick way.

use std::alloc::{Layout, alloc_zeroed};

const size: usize = 100;
type MyArray = [[[[f32; size]; size]; size]; size];

fn create_matrix() -> Box<MyArray> {
    unsafe {
        let layout = Layout::new::<MyArray>();
        let ptr = alloc_zeroed(layout);
        Box::from_raw(ptr as *mut MyArray)
    }
}

Trying to do it directly with Box::new runs into trouble with stack overflows.

You can also investigate the ndarray crate.

1 Like

Great, thank you !

For some reason, ndarray loses a lot of performance when inserting a data that using box

Almost all matrix/array libraries use fixed size arrays.
Inserting should be avoided at all cost as it definitely requires copying memory and often reallocating.

Array4<f32> should do the trick.

2 Likes

Hello @s3bk,

If i have this code, do you think that i can improve the performance by using ndarray ? Because i can't do it.

type Matrix = [[f32; BLOCK_SIZE]; BLOCK_SIZE];

fn floyd_serial(ij: &mut Matrix, ik: Matrix, kj: Matrix) {
    kj.iter().enumerate().for_each(|(k, kj)| unsafe {
        zip(ij.iter_mut(), ik.iter().map(|ik| ik.get_unchecked(k))).for_each(|(ij, ik_cell)| {
            zip(ij.iter_mut(), kj.iter())
                .map(|(cell, kj_cell)| (ik_cell + *kj_cell, cell))
                .for_each(|(sum, cell)| {
                    if sum < *cell {
                        *cell = sum;
                    }
                });
        });
    });
}

The problem with ndarray is the *cell = sum, takes too long.

Obviously this is an "Array2" but this is the most consuming part of my code.

One factor is that your matrix size is fixed at compile time, where as ndarray uses variable size.

But sure, it is possible to handwrite code for a specific purpose that is faster than a general purpose library.

EDIT: I guess you may want to take the 2nd and 3rd argument by reference?

Great, I understand.

Can I ask you what is your analysis of the code I passed ? This would be the equivalent code without iterators, do you think you could do some optimization ?

fn floyd_serial(ij: &mut Matrix, ik: Matrix, kj: Matrix) {
(0..BLOCK_SIZE).for_each(|k| unsafe {
    let kj = *kj.get_unchecked(k);
    (0..BLOCK_SIZE).for_each(|i| {
        let ik = ik.get_unchecked(i).get_unchecked(k);
        let row = ij.get_unchecked_mut(i);
        (0..BLOCK_SIZE).for_each(|j| {
            let sum = ik + kj.get_unchecked(j);
            let ij = row.get_unchecked_mut(j);
            if sum < *ij {
                *ij = sum;
            }
        });
    });
});

}

Well, i need to modify only the first argument. Second and third are other Matrix that i don't need to modify

I would probably write it like this:

fn floyd_serial(ij: &mut Matrix, ik: Matrix, kj: Matrix) {
    for k in 0 .. BLOCK_SIZE {
        for i in 0 .. BLOCK_SIZE {
            for j in 0 .. BLOCK_SIZE {
                let sum = ik[i][k] + kj[k][j]
                let cell = &mut ij[i][j];
                if sum > *cell {
                    *cell = sum;
                }
            }
        }
    }
}

The bounds are known and the compiler can see that all your indices are always in bounds.
So I don't expect any bounds checks in the compiled code.

looks very clean

Wow yes, this code is very clean. So when is it better to use iterators ? Why is it not necessary in this case ?

The main advantage for iterators on slices is that it avoids bounds checking twice.
They add complexity tho… which can trip up the optimizer.

Here the optimizer can eliminate the bounds checks, so iterators have no advantage.

Excelent ! You and Alice helped me a lot.

Well I think this would be the best version. I suppose that no library as ndarray or nalgebra do something different that improves the performance

assuming that each cell of the matrix has 2 elements, if you build a struct with 2 elements the performance is terrible. Should I use 2 different matrices ?

#[derive(Copy, Clone)]
struct A {
  cell: f32,
  path: i32,
}

type Matrix = [[A; BLOCK_SIZE]; BLOCK_SIZE];

Yes, split it up. SIMD works best with uniform data in memory.

I'm sorry to keep this up. Here is the code, it works perfect. With a sample dataset it takes me ~1.06 seconds.

fn floyd_serial(ij: &mut Matrix, ik: Matrix, kj: Matrix) {
    for k in 0..BLOCK_SIZE {
        for i in 0..BLOCK_SIZE {
            for j in 0..BLOCK_SIZE {
                let sum = kj[k][j] + ik[i][k];
                let cell = &mut ij[i][j];
                if sum < *cell {
                    *cell = sum;
                }
            }
        }
    }
}

This other code adds a matrix and adds 600 milliseconds more (takes 1.6 seconds). (PMatrix is the same as Matrix but with i32 instead of f32).

fn floyd_serial(ij: &mut Matrix, ik: Matrix, kj: Matrix, path: &mut PMatrix, coord: i32) {
for k in 0..BLOCK_SIZE {
    for i in 0..BLOCK_SIZE {
        for j in 0..BLOCK_SIZE {
            let sum = kj[k][j] + ik[i][k];
            let cell = &mut ij[i][j];
            if sum < *cell {
                *cell = sum;
                path[i][j] = coord;
            }
        }
    }
}

}

If in both algorithms I comment on the sentence that it assigns, obviously my algorithm does not take anything, then the expensive part of this is in the assignment (the last sentence).

Is there any way to improve the time in the assignment ? I am using jemallocator and it improves the time. I compile it in the following way: RUSTFLAGS='-C target-cpu=native' cargo run --release. I have already configured

[profile.dev]
opt-level = 3

That looks like a reasonable slowdown to me.