So as mentioned in the title, I have encountered a Stack Overflow !!! when trying to instantiate an array of 1M f64 elements on my laptop equipped with a Ryzen 4700U processor and 16GB of RAM. I use the following Rust version:
Unless I'm a complete idiot, 1M 64 bit elements should roughly take up 7.63 MB in memory, so I don't see why the Rust code bellow should not be able to compile into a functioning executable.
Here is a sample code:
use std::time::Instant;
use std::iter::successors;
fn iterator_array_fill<F, T, const N: usize>(a: &mut [T; N], a0: &T, f: F)
where
T: Copy,
F: Fn(T) -> T {
let mut i: usize = 0;
for value in successors(
Some(*a0),
|x| Some(f(*x))
).take(N) {
a[i] = value;
i += 1;
}
}
fn main() {
const N: usize = 1_000_000;
let a0: f64 = 0.5;
let before = Instant::now();
let mut a = [0_f64; N];
iterator_array_fill(&mut a, &a0, |x| x * (1.0 - x));
let after = Instant::now();
println!("{:?}", a[N-1]);
println!("time elapsed: {:?}", after.duration_since(before));
}
And the error I get when running the above code is:
thread 'main' has overflowed its stack
Does anyone have any idea what is happening here ?
Even on machines with large amounts of memory, the stack region is often much more constrained. Allocating megabytes of memory on the stack - potentially multiple times, if you're passing arrays around by value - is one of the more reliable ways to exhaust it and crash your program.
There's nothing unreasonable about a megabytes-scale array, though, and Rust is more than able to handle that on much more modest hardware than you have. The trick is, don't put the array itself on the stack. A Vec is nothing more complicated than an array, but it's allocated separately.
You may also be able to adjust the stack area's size or layout, using platform-specific tools (ulimit and friends on unix-alikes, for example).
std::thread::Builder::new().stack_size(/*put the size you want here, or set environment variable RUST_MIN_STACK*/).spawn(||{
// your code here
}).unwrap().join().unwrap();
Note RUST_MIN_STACK does not works for main thread:
Thank you all for your help. I wasn't aware that the main thread was limited in how much stack memory was available.
Now, since @Neutron3529 was the only one that suggested a solution that didn't involve giving up on stack allocated arrays, I've marked his reply as the solution. Here's an updated code example that includes his solution, for those who might want a complete example:
use std::time::Instant;
use std::mem::size_of;
fn classic_array_fill<F, T, const N: usize>(a: &mut [T; N], a0: &T, f: F)
where
T: Copy,
F: Fn(T) -> T {
a[0] = *a0;
for n in 1..N {
a[n] = f(a[n-1]);
}
}
fn main() {
const N: usize = 1_000_000;
std::thread::Builder::new()
.stack_size(size_of::<f64>() * N)
.spawn(||{
// your code here
let a0: f64 = 0.5;
let before = Instant::now();
let mut a = [0_f64; N];
classic_array_fill(&mut a, &a0, |x| x * (1.0 - x));
let after = Instant::now();
println!("{:?}", a[N-1]);
println!("time elapsed: {:?}", after.duration_since(before));
}).unwrap().join().unwrap();
}
Although I am really puzzled by the overly verbose .unwrap().join().unwrap() that terminates the thread. Is it really necessary ?
That said, threads add a lot of overhead and so, @scottmcm's solution is the more performant in the present context. I've tested a few different implementations against each other with a very rudimentary benchmark: https://play.rust-lang.org/?version=stable&mode=release&edition=2021&gist=d22b54cdd23f618dfdcc6fd519063ba5
The numbers you can see in the playground seem to suggest that the "zero cost abstractions" promise does hold (no real difference between for loops or iterators), but I've noticed some systematic differences on my own laptop that seem to suggest that the performance crown goes to the traditional for loop over a pre-allocated Vec (classic_list_fill from the playground example).
They're necessary if main returns (), at least. main can be defined to return a Result<T, E> type, instead, though:
fn main() -> std::io::Result<()> {
const N: usize = 1_000_000;
std::thread::Builder::new()
.stack_size(size_of::<f64>() * N)
.spawn(||{
// your code here
let a0: f64 = 0.5;
let before = Instant::now();
let mut a = [0_f64; N];
classic_array_fill(&mut a, &a0, |x| x * (1.0 - x));
let after = Instant::now();
println!("{:?}", a[N-1]);
println!("time elapsed: {:?}", after.duration_since(before));
})?.join().unwrap();
Ok(())
}
Note that I kept the second unwrap.
Given this change, then main will return with an error immediately if spawning a thread returns Err(e) instead of Ok(someJoinHandle). This is the same behaviour, but implemented by short-circuiting the function body and returning a result, instead of by panicking on the spot. That composes better - if you're burying this deep in a helper function somewhere, then panicking there is likely to surprise someone way down the line, whereas returning a result value signals explicitly that the operation can fail (if there isn't enough memory or enough process handles left to start a thread, for example).
I've kept the second unwrap(), because the only reason join() will return an error value is if the thread being joined has panicked. Using unwrap() here propagates that panic into the current thread. You could replace it with ? as well, if you prefer turning that panic into a normal error, or you can even drop the panic on the floor if you prefer, but I tend to believe that having panics propagate into the joining thread is more consistent.
I'm glad! It can sometimes even be more performant, too, since it avoids indexing -- if LLVM manages to remove the bounds check then it's probably about the same (and for the array it probably would) but if for some reason it can't in a particular situation, the zip might end up faster.
Ah, it looks like LLVM doesn't manage to unroll the zip version for your specific case the way it does for the for loop: https://rust.godbolt.org/z/YzE6Yvbxb
zip is, unfortunately, a bit less reliable at being zero-cost because it has two exit conditions. I wonder if the problem is that successors isn't random-access, and thus it's not getting the specializations that often help it do better.
It is not only allocate a large vector but also recursively call functions could cause stack overflow, thus I try to spawn a thread to achieve the goal, avoid "thread main has overflowed its stack".
If you're only allocating a large array, maybe
let mut a=Box::<[f64;N]>::new_uninit();
unsafe{
a.assume_init(); // SAFETY: f64 is Copy and Drop a f64 is a no-op
iterator_array_fill(&mut a, &a0, |x| x * (1.0 - x)); // fill the values, thus a is inited.
}
is better in your case, since it keeps the generics, [f64;N] rather than change it to a slice [f64], and it would eat an additional pointer of Vec.
I wouldn't bother with the uninitialized dance, here. vec![0.0; N].into_boxed_slice() will call alloc_zeroed, which will have minimal if any additional cost: https://rust.godbolt.org/z/Gd1j9M63T
Or, of course, you could just make the Vec handle the uninitialized memory stuff itself:
That's the bizarre thing here. The code I've put on the playground does include this as the last example, and the performance is pretty much the same between it and filling a preexisting vector, but on my laptop this line runs about for twice as much time as with a preexisting vector.
Example output from playground:
Hello, world!
classic fill array
result 9.999983113938067e-8
time elapsed inside thread: 54.439552ms
time elapsed including thread: 68.632508ms
classic create array
result 9.999983113938067e-8
time elapsed inside thread: 57.967211ms
time elapsed including thread: 72.558175ms
classic fill heap list
result 9.999983113938067e-8
time elapsed: 45.761678ms
classic create heap list
result 9.999983113938067e-8
time elapsed: 88.418508ms
iterator fill array
result 9.999983113938067e-8
time elapsed inside thread: 40.332828ms
time elapsed including thread: 51.346433ms
iterator fill heap list
result 9.999983113938067e-8
time elapsed: 35.609803ms
iterator create heap list
result 9.999983113938067e-8
time elapsed: 39.002958ms
And an example output from my laptop:
classic fill array
result 9.999983113938067e-8
time elapsed inside thread: 17.2647ms
time elapsed including thread: 59.4062ms
classic create array
result 9.999983113938067e-8
time elapsed inside thread: 20.1477ms
time elapsed including thread: 66.0325ms
classic fill heap list
result 9.999983113938067e-8
time elapsed: 17.9816ms
classic create heap list
result 9.999983113938067e-8
time elapsed: 68.4181ms
iterator fill array
result 9.999983113938067e-8
time elapsed inside thread: 16.3344ms
time elapsed including thread: 58.0216ms
iterator fill heap list
result 9.999983113938067e-8
time elapsed: 18.7497ms
iterator create heap list
result 9.999983113938067e-8
time elapsed: 42.4078ms
It's valid to use new_uninit().assume_init() for a Box<[MaybeUninit<T>; N]>, since no bytes of a MaybeUninit<T> have to be initialized. You are correct, though, that it would be disallowed for Box<[f64; N]>, since all bytes of an f64 must be initialized.
It depends how you write it. If you use assume_init_mut, then it might be fine, at least according to playground's MIRI. Probably because getting the &mut [f64] seems to not have a validity invariant that the things behind the pointer meet the validity invariants, and the code never reads the uninit values. It only assigns to them, which doesn't involve a read because f64 is Copy.
But even if it's valid, I do still think it's a bad way to write it. Passing a slice of uninit values to safe code without having the type-level MaybeUninit marker is just always sketchy.
no, the source of alloc_zeroed shows an additional write exists:
#[stable(feature = "global_alloc", since = "1.28.0")]
unsafe fn alloc_zeroed(&self, layout: Layout) -> *mut u8 {
let size = layout.size();
// SAFETY: the safety contract for `alloc` must be upheld by the caller.
let ptr = unsafe { self.alloc(layout) };
if !ptr.is_null() {
// SAFETY: as allocation succeeded, the region from `ptr`
// of size `size` is guaranteed to be valid for writes.
unsafe { ptr::write_bytes(ptr, 0, size) };
}
ptr
}
This isn't necessarily about Rust, but I was always under the impression that the way the NumPy library for Python or the Julia language worked is by wrapping up C arrays behind a high-level interface and that those C array were stack allocated. Does anyone know if this is simply a misunderstanding on my part or if they use some other tricks comparable to some of the solutions presented here like spawning a thread with a bigger stack space and others ?
I'm asking because, if I understand correctly and stack space limitation makes working with arrays unpractical from a performance standpoint, why wouldn't people who program mathematical libraries not base their algorithms around iterators instead, which doesn't involve the performance penalty that things like threads involve ?
(And just to rule out any confusion: are iterators stack or heap allocated ?)