Detect if any of several thread finishes or panics early?

I have a handful of threads doing things for a program. All are critical, and none are expected to exit on their own. What's the most correct/safe/ergonomic way to detect if one of them has finished early, and propagate a panic if one of them has panicked?

Things I've considered:

  • Looping and try_join-ing the join handles in sequence. Would probably work, but the loop strikes me as not the correct answer because of all the polling.
  • Thread scopes. Initially promising on the panic angle at least, but then I realized the scope only panics when ALL the contained threads have joined.
  • Async with Tokio. Make each thread a blocking Task, then do a select! or use a JoinSet. Best result I've found yet. Mirrors what Nextest does. But apparently, blocking tasks aren't meant to be equivalent to threads, and should only be used for short-running things.

Currently leaning to the Async with Tokio approach, but I have my doubts. Anyone willing to chip in their two cents?

Not sure but I think you need to make sure your threads don't panic: use catch unwind to catch all panics in each thread, and then communicate that a panic has occurred to all threads and have them cooperatively stop executing.

You could set flags to immediately abort the program upon a panic, which will exit all threads. Check if a helpful error message is printed though.

You can use async without specifically using Tokio’s blocking thread pool. You only need a oneshot channel.

let (tx, rx) = oneshot();
std::thread::spawn(move || {
    tx.send(do_actual_work());
});
// then put rx in your `select!`

This will wake up the receiver when the thread panics because tx will be dropped, but if you want to also pass the panic payload to propagate, that is just a matter of adding catch_unwind() to this. (That’s essentially the same as how the standard library puts panics in thread JoinHandles, even!)

You can also do a similar thing without any async and just a single MPSC channel, by sending a value when the thread completes or panics.

let (tx, rx) = std::sync::mpsc::channel();
/* for each thread */ {
   let tx = tx.clone();
   std::thread::spawn(move || {
        tx.send(catch_unwind(|| do_actual_work()));
    });
}

6 Likes

I assume you're talking about the panic = 'abort' option. Did look at that, but was thrown off by the loss of stack unwinding in the panicking thread.
Behavior I want from that option is:

  1. stop execution of all threads
  2. unwind panicking thread and print stack trace
  3. exit program

Default behavior skips steps 1 and 3, while abort behavior skips step 2.

Tokio takes care of most of that, and if I'm fine with exiting the program to close the not-yet-stopped threads (which I am), then It covers it all.

Does it skip backtraces though? I recall that it doesn't (which was a reason why they began building panic = 'immediate-abort').

At least, for nested panics all the backtraces are printed; see Rust Playground.

I don't remember why I thought that. I may be misinformed.

Looking at @kpreid's solution, I think I mostly overestimated the boilerplate for this one

This solution still won't magically make the threads stop if do_actual_work never finishes.

The OP mentioned:

All are critical, and none are expected to exit on their own

So you'd still need to make the do_actual_work threads block on/poll some signal is triggered when one of the other threads send a message. If do_actual_work reads from a channel you can enqueue a termination message.

If the main thread exits then the whole program stops too, so it can just wait for the first item in channel.

When the main thread of a Rust program terminates, the entire program shuts down, even if other threads are still running. However, this module provides convenient facilities for automatically waiting for the termination of a thread (i.e., join).

(std::thread - Rust)

That may be helpful. The OP didn't specify that they were going to exit the program as far as I can tell. It may also not be desirable if the work threads need to run some clean up or persist some data.

1 Like

Fits my use case, at least