How to diagnose a `stack overflow` issue's cause?


#1

As the title says, sometimes the user gets an stack overflow error like the following:

thread 'main' has overflowed its stack
fatal runtime error: stack overflow
Aborted

And, especially when the call “hierarchy” is deeply nested, he has no clue where the issues is “coming from”.


Therefore my questions are:

(A) Is there a way to get a hint of the last few functions called, so that the user can start a targeted debugging of the problem? (Currently he has no hint of the probable cause…)

(B) Is there a away for the user code to “catch” this kind of error? Similar to how the user can intercept panics?


Hopefully there is an answer for (A), at least in debug mode.

For (B) one could – if it makes sense for the particular use-case – start a separate thread and catch if it aborts from the main thread. (For example my use-case is a Scheme interpreter, thus this technique, although cumbersome, could be used to provide more useful error messages.)

Thanks for the feedback,
Ciprian.


#2

stack overflow is an aborting panic, it doesn’t unwind and is not catchable. You’d need the compiler to insert stack probes (sometimes referred to as stack banging) and initiate unwinding if the probe fails; this would necessarily incur a perf penalty.

So the best bet for debugging, for the moment, is to open the core in a debugger (e.g. gdb) and inspect the callstack there. From a user’s perspective, this isn’t all that easy necessarily, and they may not even know how to debug via the core file.


#3

You’d need the compiler to insert stack probes (sometimes referred to as stack banging) and initiate unwinding if the probe fails; this would necessarily incur a perf penalty.

Could you point me into the direction on how to enable such a feature in Rust? (Or it is not currently available?)


#4

Yes, it’s not available AFAIK - I was theorizing about what would need to occur :slight_smile:. This is the approach that some of the managed runtimes (e.g. JVM) take.

Right now rust uses guard pages to trap the SO (which is a segfault at the kernel level) so it can at least tell you that it’s a stack overflow and not some other segfault.


#5

I understand. So basically there is – currently and easily accessible – no way to directly diagnose stack overflow aborts.


However approaching the issue from another angle: is there in Rust a simple way to trace – printing on stderr or similar – all the entries (and possibly exits) from user functions?

Perhaps a procedural macro crate that would allow one to “annotate” all functions of interest? Or even better an external tool similar to kcov that would enable this?


#6

RUSTBACKTRACE=1


#7

This doesn’t do anything for stack overflows because, as mentioned, they abort.


#8

The trace macro might help you zero in on what’s happening. I agree it’s sub-optimal that Rust can’t tell you on a debug build what the last function called was. It would be great to at least try something like that out behind a feature.


#9

It seems that the compiler now can insert stack probes:

Is there an outstanding ticket for tracking this issue as it is not uncommon to overflow a stack while coding?


#10

That’s for x86 only and it’s meant to circumvent the issue of an allocation overrunning the guard page. I think you’d need to (a) have reliable stack probes on all (tier 1, at least) platforms and (b) wire up unwinding to trigger when the probe fails.

There’s https://github.com/rust-lang-nursery/rust-wasm/issues/141, which is fairly recent, but for wasm (and is closed now). I don’t know of any existing tracking issue for unwindable stack overflows. Maybe someone else does though …


#11

I couldn’t find one, so I’ve raised one here: https://github.com/rust-lang/rust/issues/51405