What can I do to avoid "thread 'main' has overflowed its stack" when working with large arrays

Hello everyone !

So as mentioned in the title, I have encountered a Stack Overflow !!! :partying_face: when trying to instantiate an array of 1M f64 elements on my laptop equipped with a Ryzen 4700U processor and 16GB of RAM. I use the following Rust version:

> rustc --version --verbose     
rustc 1.61.0 (fe5b13d68 2022-05-18)
binary: rustc
commit-hash: fe5b13d681f25ee6474be29d748c65adcd91f69e
commit-date: 2022-05-18
host: x86_64-pc-windows-msvc
release: 1.61.0
LLVM version: 14.0.0

Unless I'm a complete idiot, 1M 64 bit elements should roughly take up 7.63 MB in memory, so I don't see why the Rust code bellow should not be able to compile into a functioning executable.

Here is a sample code:

use std::time::Instant;
use std::iter::successors;

fn iterator_array_fill<F, T, const N: usize>(a: &mut [T; N], a0: &T, f: F)
where
    T: Copy,
    F: Fn(T) -> T {
    let mut i: usize = 0;
    for value in successors(
        Some(*a0),
        |x| Some(f(*x))
    ).take(N) {
        a[i] = value;
        i += 1;
    }
}

fn main() {
    const N: usize = 1_000_000;
    let a0: f64 = 0.5;

    let before = Instant::now();
    let mut a = [0_f64; N];
    iterator_array_fill(&mut a, &a0, |x| x * (1.0 - x));
    let after = Instant::now();

    println!("{:?}", a[N-1]);
    println!("time elapsed: {:?}", after.duration_since(before));
}

And the error I get when running the above code is:

thread 'main' has overflowed its stack

Does anyone have any idea what is happening here ? :sweat_smile:

Even on machines with large amounts of memory, the stack region is often much more constrained. Allocating megabytes of memory on the stack - potentially multiple times, if you're passing arrays around by value - is one of the more reliable ways to exhaust it and crash your program.

There's nothing unreasonable about a megabytes-scale array, though, and Rust is more than able to handle that on much more modest hardware than you have. The trick is, don't put the array itself on the stack. A Vec is nothing more complicated than an array, but it's allocated separately.

You may also be able to adjust the stack area's size or layout, using platform-specific tools (ulimit and friends on unix-alikes, for example).

5 Likes

use thread

std::thread::Builder::new().stack_size(/*put the size you want here, or set environment variable RUST_MIN_STACK*/).spawn(||{
    // your code here
}).unwrap().join().unwrap();

Note RUST_MIN_STACK does not works for main thread:

fn overflow(){
    let v=[0u8;16*1024*1024];
    dbg!(&v[0..2]);
}
fn main(){
    std::thread::Builder::new().spawn(overflow).unwrap().join().unwrap();
    overflow();
}
$ RUST_MIN_STACK=33554432 ./test
[test.rs:3] &v[0..2] = [
    0,
    0,
]

thread 'main' has overflowed its stack
fatal runtime error: stack overflow
2 Likes

Well, on Windows, for example,

The default stack reservation size used by the linker is 1 MB.
~ https://docs.microsoft.com/en-us/windows/win32/procthread/thread-stack-size

So it doesn't seem too surprising to me that you'd have trouble.

Once you're over, say, 1024 bytes, you're usually better just allocating. And by over 65 000 bytes you should essentially always be allocating.

So change your array to a vector:

     let before = Instant::now();
-    let mut a = [0_f64; N];
+    let mut a = vec![0_f64; N];
     iterator_array_fill(&mut a, &a0, |x| x * (1.0 - x));

And update iterator_array_fill to be more general by taking a &mut [T] instead of a &mut [T; N], and you'll be good to go: https://play.rust-lang.org/?version=stable&mode=release&edition=2021&gist=04d9550c0d110eca2e82bdd5261626f2

(I also took the opportunity to replace your indexing with a zip, for fewer bounds checks and no need for the .take(N).)

10 Likes

Thank you all for your help. I wasn't aware that the main thread was limited in how much stack memory was available.

Now, since @Neutron3529 was the only one that suggested a solution that didn't involve giving up on stack allocated arrays, I've marked his reply as the solution. Here's an updated code example that includes his solution, for those who might want a complete example:

use std::time::Instant;
use std::mem::size_of;

fn classic_array_fill<F, T, const N: usize>(a: &mut [T; N], a0: &T, f: F)
where
    T: Copy,
    F: Fn(T) -> T {
    a[0] = *a0;
    for n in 1..N {
        a[n] = f(a[n-1]);
    }
}

fn main() {
    const N: usize = 1_000_000;

    std::thread::Builder::new()
        .stack_size(size_of::<f64>() * N)
        .spawn(||{
            // your code here
            let a0: f64 = 0.5;

            let before = Instant::now();
            let mut a = [0_f64; N];
            classic_array_fill(&mut a, &a0, |x| x * (1.0 - x));
            let after = Instant::now();

            println!("{:?}", a[N-1]);
            println!("time elapsed: {:?}", after.duration_since(before));
        }).unwrap().join().unwrap();
}

Although I am really puzzled by the overly verbose .unwrap().join().unwrap() that terminates the thread. Is it really necessary ?

That said, threads add a lot of overhead and so, @scottmcm's solution is the more performant in the present context. I've tested a few different implementations against each other with a very rudimentary benchmark:
https://play.rust-lang.org/?version=stable&mode=release&edition=2021&gist=d22b54cdd23f618dfdcc6fd519063ba5
The numbers you can see in the playground seem to suggest that the "zero cost abstractions" promise does hold (no real difference between for loops or iterators), but I've noticed some systematic differences on my own laptop that seem to suggest that the performance crown goes to the traditional for loop over a pre-allocated Vec (classic_list_fill from the playground example).

1 Like

P.S.

I just love this example. It's so elegant. I never thought of using zip for this purpose.

They're necessary if main returns (), at least. main can be defined to return a Result<T, E> type, instead, though:

fn main() -> std::io::Result<()> {
    const N: usize = 1_000_000;

    std::thread::Builder::new()
        .stack_size(size_of::<f64>() * N)
        .spawn(||{
            // your code here
            let a0: f64 = 0.5;

            let before = Instant::now();
            let mut a = [0_f64; N];
            classic_array_fill(&mut a, &a0, |x| x * (1.0 - x));
            let after = Instant::now();

            println!("{:?}", a[N-1]);
            println!("time elapsed: {:?}", after.duration_since(before));
        })?.join().unwrap();

        Ok(())
}

Note that I kept the second unwrap.

Given this change, then main will return with an error immediately if spawning a thread returns Err(e) instead of Ok(someJoinHandle). This is the same behaviour, but implemented by short-circuiting the function body and returning a result, instead of by panicking on the spot. That composes better - if you're burying this deep in a helper function somewhere, then panicking there is likely to surprise someone way down the line, whereas returning a result value signals explicitly that the operation can fail (if there isn't enough memory or enough process handles left to start a thread, for example).

I've kept the second unwrap(), because the only reason join() will return an error value is if the thread being joined has panicked. Using unwrap() here propagates that panic into the current thread. You could replace it with ? as well, if you prefer turning that panic into a normal error, or you can even drop the panic on the floor if you prefer, but I tend to believe that having panics propagate into the joining thread is more consistent.

Btw, you can increase the main thread stack size by setting a size instead of falling back to the linker default. E.g., using a build script:

// build.rs
use std::env;

fn main() {
    if env::var("CARGO_CFG_TARGET_ENV").as_deref() == Some("msvc") {
        println!("cargo:rustc-link-arg=/stack:{}", 8 * 1024 * 1024);
    }
}

Nice ! I didn't know cargo had such a feature. I need to remember this one.

I'm glad! It can sometimes even be more performant, too, since it avoids indexing -- if LLVM manages to remove the bounds check then it's probably about the same (and for the array it probably would) but if for some reason it can't in a particular situation, the zip might end up faster.

Unfortunately, it ends up being slightly slower on my laptop than indexing with a for loop. I don't know why that is.

Ah, it looks like LLVM doesn't manage to unroll the zip version for your specific case the way it does for the for loop: https://rust.godbolt.org/z/YzE6Yvbxb

zip is, unfortunately, a bit less reliable at being zero-cost because it has two exit conditions. I wonder if the problem is that successors isn't random-access, and thus it's not getting the specializations that often help it do better.

sorry for not reading your code.

It is not only allocate a large vector but also recursively call functions could cause stack overflow, thus I try to spawn a thread to achieve the goal, avoid "thread main has overflowed its stack".

If you're only allocating a large array, maybe

let mut a=Box::<[f64;N]>::new_uninit();
unsafe{
    a.assume_init(); // SAFETY: f64 is Copy and Drop a f64 is a no-op
    iterator_array_fill(&mut a, &a0, |x| x * (1.0 - x)); // fill the values, thus a is inited.
}

is better in your case, since it keeps the generics, [f64;N] rather than change it to a slice [f64], and it would eat an additional pointer of Vec.

I wouldn't bother with the uninitialized dance, here. vec![0.0; N].into_boxed_slice() will call alloc_zeroed, which will have minimal if any additional cost: https://rust.godbolt.org/z/Gd1j9M63T

Or, of course, you could just make the Vec handle the uninitialized memory stuff itself:

std::iter::successors(Some(*a0), |x| Some(f(*x))).take(N).collect()

rather than bothering writing things as a fill.

1 Like

That's the bizarre thing here. The code I've put on the playground does include this as the last example, and the performance is pretty much the same between it and filling a preexisting vector, but on my laptop this line runs about for twice as much time as with a preexisting vector.

Example output from playground:

Hello, world!

classic fill array
result 9.999983113938067e-8
time elapsed inside thread: 54.439552ms
time elapsed including thread: 68.632508ms

classic create array
result 9.999983113938067e-8
time elapsed inside thread: 57.967211ms
time elapsed including thread: 72.558175ms

classic fill heap list
result 9.999983113938067e-8
time elapsed: 45.761678ms

classic create heap list
result 9.999983113938067e-8
time elapsed: 88.418508ms

iterator fill array
result 9.999983113938067e-8
time elapsed inside thread: 40.332828ms
time elapsed including thread: 51.346433ms

iterator fill heap list
result 9.999983113938067e-8
time elapsed: 35.609803ms

iterator create heap list
result 9.999983113938067e-8
time elapsed: 39.002958ms

And an example output from my laptop:

classic fill array
result 9.999983113938067e-8
time elapsed inside thread: 17.2647ms
time elapsed including thread: 59.4062ms

classic create array
result 9.999983113938067e-8
time elapsed inside thread: 20.1477ms
time elapsed including thread: 66.0325ms

classic fill heap list
result 9.999983113938067e-8
time elapsed: 17.9816ms

classic create heap list
result 9.999983113938067e-8
time elapsed: 68.4181ms

iterator fill array
result 9.999983113938067e-8
time elapsed inside thread: 16.3344ms
time elapsed including thread: 58.0216ms

iterator fill heap list
result 9.999983113938067e-8
time elapsed: 18.7497ms

iterator create heap list
result 9.999983113938067e-8
time elapsed: 42.4078ms

Both were run in "release" mode.

new_uninit().assume_init() is always UB, isn't it? I don't think this is valid.

5 Likes

It's valid to use new_uninit().assume_init() for a Box<[MaybeUninit<T>; N]>, since no bytes of a MaybeUninit<T> have to be initialized. You are correct, though, that it would be disallowed for Box<[f64; N]>, since all bytes of an f64 must be initialized.

It depends how you write it. If you use assume_init_mut, then it might be fine, at least according to playground's MIRI. Probably because getting the &mut [f64] seems to not have a validity invariant that the things behind the pointer meet the validity invariants, and the code never reads the uninit values. It only assigns to them, which doesn't involve a read because f64 is Copy.

But even if it's valid, I do still think it's a bad way to write it. Passing a slice of uninit values to safe code without having the type-level MaybeUninit marker is just always sketchy.

2 Likes

no, the source of alloc_zeroed shows an additional write exists:

    #[stable(feature = "global_alloc", since = "1.28.0")]
    unsafe fn alloc_zeroed(&self, layout: Layout) -> *mut u8 {
        let size = layout.size();
        // SAFETY: the safety contract for `alloc` must be upheld by the caller.
        let ptr = unsafe { self.alloc(layout) };
        if !ptr.is_null() {
            // SAFETY: as allocation succeeded, the region from `ptr`
            // of size `size` is guaranteed to be valid for writes.
            unsafe { ptr::write_bytes(ptr, 0, size) };
        }
        ptr
    }

Last question(s):

This isn't necessarily about Rust, but I was always under the impression that the way the NumPy library for Python or the Julia language worked is by wrapping up C arrays behind a high-level interface and that those C array were stack allocated. Does anyone know if this is simply a misunderstanding on my part or if they use some other tricks comparable to some of the solutions presented here like spawning a thread with a bigger stack space and others ?

I'm asking because, if I understand correctly and stack space limitation makes working with arrays unpractical from a performance standpoint, why wouldn't people who program mathematical libraries not base their algorithms around iterators instead, which doesn't involve the performance penalty that things like threads involve ?

(And just to rule out any confusion: are iterators stack or heap allocated ?)