Efficient allocation of empty 2D matrix with very long rows

To allocate a 2D matrix with very long rows I have in the end created a function like:

use std::convert::TryInto;

const NC: usize = 2_000_000;
const NR: usize = 3;

fn foo1() -> Box<[[f64; NC]; NR]> {
    let mat: Box<[f64; NR * NC]> =
        vec![0.0; NR * NC]
        .into_boxed_slice()
        .try_into()
        .unwrap();

    unsafe { std::mem::transmute(mat) }
}

That currently compiles down to good enough asm:

foo1:
        push    rax
        mov     edi, 48000000
        mov     esi, 8
        call    qword ptr [rip + __rust_alloc_zeroed@GOTPCREL]
        test    rax, rax
        je      .LBB0_1
        pop     rcx
        ret
.LBB0_1:
        mov     edi, 48000000
        mov     esi, 8
        call    qword ptr [rip + alloc::alloc::handle_alloc_error@GOTPCREL]
        ud2

I've tried numerous other solutions, including the basic:

vec![[0_f64; NC]; NR]

box [[0_f64; NC]; NR]

And so on. But they cause stack overflow in debug builds, or they don't call __rust_alloc_zeroed (and they allocate and zero the buffer later, this costs some run time).

I think having well working dynamic arrays is important in a system language. So I hope those two problems (stack overflow for something that should be purely heap allocated, and calling only rust_alloc_zeroed for a zero filled 2D matrix) will be fixed.

In the meantime do you know if there's a less unsafe and/or simpler solution compared to foo1?

(I prefer to keep the compile-time knowledge of both the number of rows and columns because this helps the compiler elide more array bound tests later from the code.)

If I am not mistaken, don't these create matrices with long columns?
Also, when we do have long rows and short columns, as in the code here, I don't see any stack-overflow with cargo run. The thread::sleep() is there to give me time to check htop. And as expected, about 2.5GB is associated with this the process, which I can assume is almost completely on the heap

Fixed, thank you :slight_smile:

Also, when we do have long rows and short columns, as in the code [here], I don't see any stack-overflow with cargo run .

Right, the problem happens to me (rustc 1.58.0-nightly 2021-11-04 x86_64-pc-windows-gnu) with long rows (but the rows must be fixed-sized arrays, of course).

I suppose with an external crate it can be done. E.g.

const NC: usize = 2_000_000;
const NR: usize = 3;

pub fn foo1() -> Box<[[f64; NR]; NC]> {
    bytemuck::allocation::zeroed_box()
}

It could also be used to make the original solution safe

const NC: usize = 2_000_000;
const NR: usize = 3;

pub fn foo1() -> Box<[[f64; NR]; NC]> {
    let mat: Box<[f64; NR * NC]> =
        vec![0.0; NR * NC]
        .into_boxed_slice()
        .try_into()
        .unwrap();

    bytemuck::allocation::cast_box(mat)
}
5 Likes

Well, is that suprising? A very large fixed-size array should cause a stack-overflow. That's the case even in C.

1 Like

It's boxed immediately, so I'm guessing OP expects (not unreasonably) that the large stack allocation be omitted, and the zeroed data be placed directly in the heap allocation, without any intermediate copies.

3 Likes

Thank you, you're right steffahn. In the stdlib we now have unusual functions like slice::as_chunks_unchecked_mut, meant for similar usages, but they still aren't enough. Something like a safe array resizing and safe box (like that cast_box) resizing could be added :slight_smile:. They could be called array::as_chunks_mut and Box::as_chunks_mut, or similar.

Here's a still unsafe but type-annotated version containing soundness checks, which are optimized away completely:

use std::mem::{ size_of, size_of_val, align_of, align_of_val };

const NR: usize = 2_000_000;
const NC: usize = 3;

type Item = f64;
type Vector = Vec<Item>;
type Array = [[Item; NR]; NC];

pub fn get_zeros() -> Box<Array> {
    let vector: Vector = vec![0.0; NR * NC];
    let vector_size = size_of_val(&*vector);
    let vector_align = align_of_val(&*vector);
    let target_size = size_of::<Array>();
    let target_align = align_of::<Array>();

    assert_eq!(vector_size, target_size);
    assert_eq!(vector_align, target_align);

    let ptr = Box::into_raw(vector.into_boxed_slice());
    
    unsafe { Box::from_raw(ptr as *mut Array) }
}

In Rust 1.56 table, his compiles to the following (using -C opt-level=3 -C target-cpu=native):

example::get_zeros:
        push    rax
        mov     edi, 48000000
        mov     esi, 8
        call    qword ptr [rip + __rust_alloc_zeroed@GOTPCREL]
        test    rax, rax
        je      .LBB0_1
        pop     rcx
        ret
.LBB0_1:
        mov     edi, 48000000
        mov     esi, 8
        call    qword ptr [rip + alloc::alloc::handle_alloc_error@GOTPCREL]
        ud2
4 Likes

If you are on nightly and want to use new_uninit and allocator_api feature:

#![feature(new_uninit, allocator_api)]

const NR: usize = 2_000_000;
const NC: usize = 3;

unsafe trait ValidZero : Sized {
    fn zeroed_box_matrix<const ROWS: usize, const COLS: usize>() -> Box<[[Self; ROWS]; COLS]> {
        let data = Box::new_zeroed();
        unsafe {
            // Safety: zeroed is a valid Self
            data.assume_init()
        }
    }
}

// Safety: zero is a valid f64
unsafe impl ValidZero for f64 {}


pub fn foo1() -> Box<[[f64; NR]; NC]> {
    f64::zeroed_box_matrix()
}
1 Like

Bruecki your solution too leads to an efficient asm :slight_smile:
Now I think a simple and safe function like bytemuck cast_box should be in the stdlib.
(I've marked one answer of this thread as "solution", but I've appreciated all the answers and ideas).

I suppose https://github.com/rust-lang/project-safe-transmute/blob/master/rfcs/0000-safe-transmute.md is a related effort.

2 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.