Exception safety in work-stealing thread pool

For the past couple days I've been working on jobsteal, which is a work-stealing fork-join threadpool. It works like a charm if the jobs you submit don't panic, but I've been having some trouble getting it to work in an exception-safe manner.

I spoke with @Gankra briefly on IRC, and he suggested I make a post here.

The key difference between jobsteal's model and that of @Kimundi's scoped_threadpool and @aturon's crossbeam is that join handles and the scope object busy-wait: while they wait for jobs to finish, they run other jobs. This is done to avoid wasting time.

The code up at the repo is far from complete, but it works perfectly provided jobs do not panic.

There are a few questions I'd like to discuss here:

  • Where should job data be stored? Should it be shared via Arc, or stored in thread-local pools?
  • How should worker threads handle panics? Should they silently fail, or be wrapped in a catch_panic/recover call that reboots them on panic?
  • Can you safely implement a 'scoped' feature in a system where jobs can both panic and be spread over an arbitrary number of threads?
3 Likes

Hi. Interesting project! I will look at it closer when I get time. I wrote a work-stealing fork-join threadpool in the beginning of this year and I just wanted to link it here in case you wanted some more inspiration. It does not solve the problem with panics and it also has some other shortcomings, I wrote it before I knew very much Rust and before it had a stable release :smile:

Repo: https://github.com/faern/forkjoin
Benchmarking repo: https://github.com/faern/forkjoin-benchmarking
Thesis: http://publications.lib.chalmers.se/records/fulltext/219016/219016.pdf

If you consider the pool a global resource, you probably want to catch the panic and keep the thread running to handle other jobs... ideally, you have a join() method which lets the caller handle a panic (like JoinHandle in std::thread - Rust).

I don't think there's any problem here? Obviously, if the scoped() callback panics, you need to wait for all pending jobs to terminate, but that doesn't seem complicated.

Thanks for the link. I remember hearing something about that a long time ago but couldn't seem to find it when I searched. I'll be sure to check it out

The main issue I was having with implementing it in that manner were double-panics. In general, you can have a Scope object which, in its destructor, sends a message to all threads to finish all their jobs. That destructor will run, even if the thread where the Scope resides runs a panicky job. Since we want the one-thread case to work, we need that same thread to keep running jobs, any of which could panic again and bring the whole thread down.

Part of the issue stems from the design decision to be able to use the main thread as a job worker. It may be more prudent to necessitate at least one worker thread, while giving the main thread a queue only for other threads to steal from. If jobs can arbitrarily panic, it just comes down to chance whether that job runs on a restartable worker thread or brings the whole job system down from the main thread. Especially so, since an RNG is used to select queues to steal from.

I realize some of my panic-safety woes were a direct result of allocating jobs in thread-local Boxes which would be dropped when panicking, leaving the pointers in the other threads' job queues dangling. That was just naive design in the prototype on my part, and I've been implementing a global, Arc'd queue and cutting down on unsafe code in another branch.

It's important to me that users be able to "release" a job to run at an arbitrary time in its lifetime or join it manually. What I consider secondary is that users be able to discover when a thread panics. That necessarily requires a couple things:

  1. When submitting a job, you create a Sender/Receiver pair for sending job status to the JoinHandle. The message sent would be very small, something like
    enum JobExitStatus { Success, Panic(Box<Any>) }
    would do the trick. I don't know how expensive Sender and Receiver are, though, and an eventual goal for this library is to be largely lock-free and very performant. It's definitely more expensive than a single atomic store.
  2. Catching panics at the job level. I haven't fully comprehended the RFC to amend catch_panic yet, but until that lands (and if this is to work on stable in any capacity), you have to box up a closure calling the already-boxed job function with the worker as a parameter and then transmute its lifetime to 'static and nest calling it in catch_panic. If/When the recover changes land, my understanding is that it would preclude closures wrapping mutable references or RefCells from being called, which would only make implementing a scoping mechanism more difficult.

A question the above raises is whether this is an appropriate use of catch_panic at all. The doc page seems to imply that this is, in general, to be used to wrap possibly-panicking functions across FFI boundaries.

If you want to be fancy, you can probably stuff a JobExitStatus into an atomic pointer...

Seems appropriate to me: you're trying to expose semantics roughly equivalent to thread::spawn, which uses catch_panic in a similar way.

Also, slightly off-topic but if you're looking for inspiration for inspiration for a work-stealing thread pool, Servo has one: https://github.com/servo/servo/blob/master/components/util/workqueue.rs . (IIRC it isn't panic-safe.)

1 Like