To allocate a 2D matrix with very long rows I have in the end created a function like:
use std::convert::TryInto;
const NC: usize = 2_000_000;
const NR: usize = 3;
fn foo1() -> Box<[[f64; NC]; NR]> {
let mat: Box<[f64; NR * NC]> =
vec![0.0; NR * NC]
.into_boxed_slice()
.try_into()
.unwrap();
unsafe { std::mem::transmute(mat) }
}
That currently compiles down to good enough asm:
foo1:
push rax
mov edi, 48000000
mov esi, 8
call qword ptr [rip + __rust_alloc_zeroed@GOTPCREL]
test rax, rax
je .LBB0_1
pop rcx
ret
.LBB0_1:
mov edi, 48000000
mov esi, 8
call qword ptr [rip + alloc::alloc::handle_alloc_error@GOTPCREL]
ud2
I've tried numerous other solutions, including the basic:
vec![[0_f64; NC]; NR]
box [[0_f64; NC]; NR]
And so on. But they cause stack overflow in debug builds, or they don't call __rust_alloc_zeroed (and they allocate and zero the buffer later, this costs some run time).
I think having well working dynamic arrays is important in a system language. So I hope those two problems (stack overflow for something that should be purely heap allocated, and calling only rust_alloc_zeroed for a zero filled 2D matrix) will be fixed.
In the meantime do you know if there's a less unsafe and/or simpler solution compared to foo1?
(I prefer to keep the compile-time knowledge of both the number of rows and columns because this helps the compiler elide more array bound tests later from the code.)
If I am not mistaken, don't these create matrices with long columns?
Also, when we do have long rows and short columns, as in the code here, I don't see any stack-overflow with cargo run. The thread::sleep() is there to give me time to check htop. And as expected, about 2.5GB is associated with this the process, which I can assume is almost completely on the heap
Also, when we do have long rows and short columns, as in the code [here], I don't see any stack-overflow with cargo run .
Right, the problem happens to me (rustc 1.58.0-nightly 2021-11-04 x86_64-pc-windows-gnu) with long rows (but the rows must be fixed-sized arrays, of course).
It's boxed immediately, so I'm guessing OP expects (not unreasonably) that the large stack allocation be omitted, and the zeroed data be placed directly in the heap allocation, without any intermediate copies.
Thank you, you're right steffahn. In the stdlib we now have unusual functions like slice::as_chunks_unchecked_mut, meant for similar usages, but they still aren't enough. Something like a safe array resizing and safe box (like that cast_box) resizing could be added . They could be called array::as_chunks_mut and Box::as_chunks_mut, or similar.
Bruecki your solution too leads to an efficient asm
Now I think a simple and safe function like bytemuck cast_box should be in the stdlib.
(I've marked one answer of this thread as "solution", but I've appreciated all the answers and ideas).