Preventing abort on secondary threads


#1

I have an architecture with 2 threads, the main thread and a worker thread that has a larger-than-normal stack.
It has happened once or twice that the worker thread has overflowed its stack, at which point the entire program aborts.
This abortion behavior is very undesirable. What I want instead is to get some kind of error signal which I can use in the main thread to restart the thread and its contents.

I thought I could accomplish that with std::panic::catch_unwind(), but it turns out that that is useless when trying to prevent aborts.

Does Rust have a reliable way to turn aborts into an error signal? At this point I don’t even care about things like an error cause; The combination of not bluntly aborting on the one hand, plus being able to restart the busted thread on the other, would allow me to at least implement a reliable recovery strategy.

BTW: As for why I have 1 single worker thread, that’s because it’s not possible to alter the stack size of the main thread in Rust, whereas it is possible to do using user-managed threads. The single-thread solution would have had my preference; The extra thread is a necessary but rather hard-handed and blunt workaround.


#2

It’s a bad idea to attempt handling SO. It’s an asynchronous exception that can happen virtually anywhere (not literally but close enough in practical terms). Unwinding can be futile because handlers may hit a secondary SO while attempting to handle it (I know you’re not looking to do that, but we’re talking in general terms). Invariants can get messed up since code doesn’t expect unwinding to happen at virtually any point in execution. So while it may be possible to withstand it in some really narrow set of circumstances, it’s generally not a good idea.

Why are you overflowing the stack? Is it possible to fix the code?

I’d probably look at IPC and move the SO prone code into a subprocess instead of thread. You’ll get more overall resilience that way. It’ll be more upfront work to set it up but if this is something fundamental to your application then it might be worth it.


#3

The reason for the possibility of stack overflow is the combination of recursive descent parsing + a deeply nested grammar, which yields deeply nested parse trees. Processing these trees translates to high peak stack usage. The grammar was not defined by me, and is so deeply nested in order to handle things like operator precedence and some other ambiguities. Thus, when an expression (in that language) being parsed is too deeply nested, the code will overflow and AFAIK there is nothing I can do to prevent that. All I can potentially do is recover from it, which is less than ideal as state can still be lost, but it’s infinitely better than not handling the SO at all and just letting it crash in production.

I have already taken steps to minimize the occurrence of SO. However, the possibility will never go away completely and needs to be handled properly. Here that means “detect in the main thread that the SO happened in the worker thread, then restart the worker thread”. my original question.

Aside from that, I strongly disagree that it is a bad idea to handle this error. It is the sole weak spot in the application that can and more importantly already has caused issues in production.

Multi-process is not an option for the same reason I have 2 threads: Rustc effectively forced me build a 2-thread arch by not allowing me to set the stack size of the main thread. If it did, 1 thread in 1 process would perhaps suffice, depending on the behavior of setting the main thread’s stack size.

As it stands however, if I were to move to multi-proc that would cause many headaches and solve absolutely nothing, since in the “slave” proc I would have this exact same issue again, and there is no signal I can send to the effect of “Hey there main proc, I just SO’d”. I’m not even sure it’s possible, as by definition the slave proc is already in a messed up state on when it overflows its stack. If I could execute arbitrary code at that point I could just emit a simple Result<_, StackOverflowErr> or something like that without involving more processes at all. Or perhaps even just continue parsing.


#4

How about adding a recursion counter then you can handle if it reaches some limit you set?

See also https://stackoverflow.com/questions/7291273/is-it-possible-to-terminate-only-the-one-thread-on-receiving-a-sigsegv for why trying to handle the SO probably isn’t a good idea.


#5

You would install a SIGCHLD signal handler in the parent process. If the child proc dies, including via an abort, your signal handler would be notified - you can then get more info on the child proc’s fate via wait‘ing on it.


#6

Thanks for the suggestions! I’ll have to consider each of them before acting on it.


#7

The SIGCHLD solution is fairly robust if you know you’ll only ever run on *nix platforms (I think Windows does things differently). The “right” answer to these kinds of problems is to prevent them from happening in the first place… Although that’s not exactly helpful or feasible. It sounds like the root issue is the deeply recursive nature of your problem though, and there may be ways to mitigate that.

I know you can typically convert a recursive algorithm into an iterative one, would it be possible to do that in your case? Otherwise you may be able to use a different parsing algorithm altogether. The lalrpop crate may be a good starting place, although rewriting a parser can be a pretty time-consuming process if you don’t have a thorough test suite on hand.


#8

I know you can typically convert a recursive algorithm into an iterative one, would it be possible to do that in your case? Otherwise you may be able to use a different parsing algorithm altogether. The lalrpop crate may be a good starting place, although rewriting a parser can be a pretty time-consuming process if you don’t have a thorough test suite on hand

No it’s likely not possible to do this iteratively due to many functions calling each other recursively at times that are not predictable in advance, at least not in general. On top of that I use scannerless parsing, as having the lexer step separate from parsing proper introduces other parsing issues. And in any case the herculean effort that would be required if it were possible immediately means this is not a realistic option.

It’s unfortunately also not practically doable to use either LL or LR, since my solution has the ability to parse any Context-Free Grammar (and the code could perhaps even be adjusted to do context-sensitive parsing) not just some semi-arbitrary subset, and the reduction in functionality would not be acceptable.

The only real options available here are either SGLR (which is theoretically equivalent to my solution, but has its own problems and is thus not preferable, not to mention the work involved in transitioning), having OSes finally break with the arbitrary stack size limitations (a stack of 1GB should be possible if the programmer requests it. The current limitations are rather arbitrary and in the vein of “640kb should be enough for everybody”), or I’ll have to bound the stack usage manually.
I’m currently leaning towards the last option, as I don’t have the resources to alter the core of entire kernels.