Rust does not eagerly free stack space

Hello, I am writing embedded and one single problem continuously eats hours of my time.

Imagine some function needs to make a big (relatively) allocation on the stack. It allocates on the stack, uses the allocated value and no longer needs it. Value has no drop and it is scoped in { } inside the function, so there is no need to keep it around until the end of the function.

But this is exactly what Rust does. I inspect assembly and see that Rust always generates one sp bump at the start for the worst case and will not deallocate anything back. The problems begin when I call next functions - they run on a stack with less space and fail. I constantly need to go around and toke chunks of code into inline(never) functions just to salvage the situation a little. Is it by design? This is horrible. I would even accept if Rust would generage memmove to defragment the stack to reduce stack usage if it is required.

Maybe I am missing something? Some compilation flag that will magically solve all problems?

Here is a simple test, on std. It would be great if blocks behave as inline(never) function by default. It would save tons of RAM and time.

fn main() {
    let mut pointers = [0usize; 100];
    test1_1(&mut pointers);

    println!("Test 1 Results:");
    print_abs_and_diff(&mut pointers[0..4]);

    test2_1(&mut pointers);
    println!("\nTest 2 Results:");
    print_abs_and_diff(&mut pointers[0..4]);
}

fn print_abs_and_diff(pointers: &mut [usize]) {
    println!("Absolute stack pointers:");
    for (i, &ptr) in pointers.iter().enumerate() {
        println!("Depth {}: {:#x}", i, ptr);
    }
    println!("\nDifferences between consecutive stack pointers:");
    for i in 1..pointers.len() {
        let diff = pointers[i - 1] as isize - pointers[i] as isize;
        println!("Depth {} to {}: {}", i - 1, i, diff);
    }
}

///////////////////////////////////

#[inline(never)]
fn test1_1(p: &mut [usize]) {
    p[0] = get_stack_pointer();
    test1_2(p);
}

#[inline(never)]
fn test1_2(p: &mut [usize]) {
    {
        p[1] = get_stack_pointer();
        let mut big = [0u8; 3 * 1000 * 1000];
        core::hint::black_box(&mut big);
    }
    p[2] = get_stack_pointer();
    test1_3(p);
}

#[inline(never)]
fn test1_3(p: &mut [usize]) {
    p[3] = get_stack_pointer();
}

///////////////////////////////

#[inline(never)]
fn test2_1(p: &mut [usize]) {
    p[0] = get_stack_pointer();
    test2_2(p);
}

#[inline(never)]
fn test2_2(p: &mut [usize]) {
    #[inline(never)]
    fn big(p: &mut [usize]) {
        p[1] = get_stack_pointer();
        let mut big = [0u8; 3 * 1000 * 1000];
        core::hint::black_box(&mut big);
    }
    p[2] = get_stack_pointer();
    test2_3(p);
}

#[inline(never)]
fn test2_3(p: &mut [usize]) {
    p[3] = get_stack_pointer();
}

///////////////////////////////////

#[inline(always)]
fn get_stack_pointer() -> usize {
    let sp: usize;
    unsafe {
        std::arch::asm!("mov {}, rsp", out(reg) sp);
    }
    sp
}

(Profile is Oz)

By the way, in debug builds each match arm may gen a separate allocation on the stack instead of the worst case :sweat_smile:

2 Likes

Yes. LLVM allocates stack frame for a whole function and I am not aware about any options which would allow you to change it, so manual #[inline(never)] is the only option for optimizing such cases.

2 Likes

It would be interesting if rustc or llvm could add a transformation pass to turn those (llvm-ir) allocas into (libc) allocas. We could theoretically reclaim some stack usage at the cost of runtime performance.

1 Like

I wouldn't call it by design so much as a limitation of the current implementation. Here's a related issue.

7 Likes

I don't think this is that relevant here. The issue you linked is about rust using less stack space in functions. But i understood the question to be about requiring different amount of stack at different points in the function execution.

I think that even if rust could for example put two large arrays in the same function at the same stack address (what the issue you linked is about), this still wouldn't solve OPs issue. They want to put another functions stack frame into the space where the big array previously was.

2 Likes

Libc doesn't provide alloca. It is a compiler builtin which Clang lowers to the alloca LLVM IR instruction just like for regular variables on the stack. The only difference is that it may have a dynamic size and/or isn't located in the entry block. LLVM will statically allocate space for all alloca in the entry block, while dynamically allocating space for all other alloca every time they get executed. It doesn't reuse and once hit permanently (until returning from the function) grows the stack.

I meant the alloca provided by alloca.h on some Unix-y systems. Or do you mean that this is implemented as a compiler builtin, and is therefore lowered to the alloca LLVM IR instruction regardless?

EDIT: Maybe the transformation pass could look something more like alloca(n) --> alloca(black_box(n)) then?

That's probably not a problem on 64 bit systems as unused stack pages can just be unmapped by OS, but when each address is physical it is a huge problem.

Yeah. The alloca definition in alloca.h of glibc is a macro expanding to __builtin_alloca, which gets lowered to the alloca instruction by clang and equivalent by gcc. alloca can only be a compiler builtin as it messes with the stack pointer, but when returning from a function you have to restore the stack pointer to the original value.

That doesn't help with freeing memory at all. Dynamic alloca never gets deallocated by LLVM either.

It doesn't help the case where the function is getting inlined? I thought that was @Ddystopia's problem here, that LLVM can see that an (implied) dealloc of a compiletime-known-size alloca has no observable effects and so postpones that dealloc beyond function scope.

Maybe LLVM's alloca is not as powerful as GCC's here? GCC/glibc promises that their alloca allocates bytes on the stack of the calling function, and not in a higher-level caller.