Handling unintended panics in Tokio tasks gracefully

Hello!

I've been looking into using Tokio for a toy project and I'm trying to understand how to handle unintended panics in a Tokio task gracefully.

I understand that my code should never panic if it could instead use normal error handling but I would still like to handle panics gracefully. They could accidentally come from my code or maybe some crate panics in certain cases that I didn't properly account for.

As far as I understand, Tokio will by default let the thread crash, print the error to stderr and carry on. I noticed some work was being done to offer more control (tokio-rs/tokio#495, tokio-rs/tokio#700).

Ideally, I want to be able to see which tasks failed by panicking, log this somewhere, possibly restart the task that failed or maybe start another task that has the same purpose as the task that just failed. Is the right way to do this to use catch_unwind on the Future? Is there some solution involving channels that's the right way to do this? Am I trying to do too much and would end up fighting with the Tokio runtime - should I use another solution/crate? Any suggestions, advice or thoughts would be helpful, thanks!

You can call catch_unwind on your future before giving it to Tokio. This will turn it into a fallible future that results in an error when the inner future panics.

It depends on what you mean by „gracefully“. I usually do one of two things:

  • Declare that panics are serious and not tolerated. I compile with panic = "abort".
  • Can happen sometimes. For these I handle panics in „critical“ places (eg. a singleton background thread that, if it dies, makes the application in half-dead state) where I abort manually. For the rest, I divert the panics to log with log-panics. Then they appear in logs (unlike when printed on stderr) and can appear in some kind of monitoring system.

I've also made a wrapper in hyper-based service to turn panics into 500s, using catch_panic.

Thanks for the responses!

@alice, thanks for confirming that the approach of catch_unwind on the Future passed to Tokio is a reasonable one.

@vorner, this is about the second case. I don't want to abort because I suspect some tasks will have panics that are recoverable wrt what the system needs from them. I think the approach I'm considering is similar to yours which is to have some kind of "supervisor" which can deal with other tasks panicking but if the supervisor panics for some reason, then the program aborts. log-panics looks useful, thanks! Do you have any examples of the setup you've described that I can take a look at?

Sorry, no, the code's owned by the company I work for and is not public.