A crazy idea: QSBR-friendly thread pool design

Hi everyone! I would like to call in the other multi-threading nerds of u.r.l.o, as I just had a crazy idea which sounds too good to be true, and would like others to cross-check it. If I'm not wrong, this would have the potential to dramatically simplify the development of lock-free algorithms in Rust, one step beyond what Crossbeam already achieved, by making QSBR (Quiescent State Based Reclaimation) both safe and practical in Rust.

And as for those who fall less in the multi-threading nerd category, don't worry: I will explain everything so that others can check if my understanding is correct :slight_smile:

Lock-free algorithms and memory reclaimation

Lock-free thread synchronization protocols are gaining popularity in scenarios where data is mostly read but infrequently written, as well as in some scenarios where shared data structures must scale as well as possible to high write rates while avoiding the inherent limitations of locks (well, up to the fundamental limit that modern CPU caching mechanisms are generally hostile to write-heavy workloads, anyway).

At the heart of most lock-free algorithms lies the following observation: the only reason why locking is needed in multi-threaded programs is that threads must never observe half-written data (a scenario known as a "data race"). If instead of modifying data in place, a writer thread can produce a modified version in private storage and atomically "swap in" a pointer to the modified version in shared storage, then no lock is required for reading and writing, and this allows eliminating many of the historical disadvantages of mutexes, at the price of encountering new problems specific to lock-free designs:

  • Increased memory usage (multiple versions of the data may coexist in RAM)
  • More indirection in the data structure, reducing CPU cache efficiency (as atomic writes of large data blocks requires pointer or index indirection in current CPU generations)
  • Extra writer work (if in-place modification is more efficient than writing a fresh copy).

There are, however, two hard problems that one needs to face when writing general-purpose lock-free algorithms which deal with multiple readers or multiple writers:

  • The first one is the ABA problem, which essentially revolves around the difficulty of detecting concurrent modifications in the joint presence of multiple writers and reuse of memory allocations (how does a writer know if another writer got there first?).
  • The second one is the memory reclaimation problem, which revolves around the difficulty of detecting when a piece of data can be discarded in the presence of multiple readers (i.e. when all readers which were reading from it at the time where a new version was swapped in are done with it).

I think I might just have gotten a very interesting idea pertaining to the latter problem.

From memory reclaimation to QSBR

The difficulty of devising a good lock-free memory reclaimation scheme lies in the fact that said scheme must not re-introduce the reader-side scalability problems that we expected to do away with by getting rid of mutexes.

This is, for example, the reason why good old reference counting does not perform very well at this task: in a concurrent environment, reference counting means incrementing an atomic counter whenever a reader accesses a data structure node and decrementing said counter (or incrementing another) whenever the reader is done with the node. This ends up having prohibitive synchronization overhead on data structures with lots of nodes, such as linked lists. As a result, lock-free reference counting is only efficient in specific scenarios, and not a generally applicable algorithm.

As revealed by most performance evaluations, the best-performing lock-free memory reclaimation scheme is generally by a good margin Quiescent State Based Reclaimation (QSBR), which operates as follows:

  • The writer knows how many readers are around
  • Readers periodically signal that they are not reading any lock-free data structures ("quiescent states")
  • Whenever a new version of data is swapped in, the writer puts the old version of the data in a garbage list and starts a "grace period", waiting for all readers to be done with the old data.
  • Once all readers have gone through a quiescent state, the old data is guaranteed not to be accessed anymore and can be discarded.

The fundamental reason why QSBR works so well is that readers can choose when they communicate with other threads. They don't need to do so on every access to the lock-free data structure. Communication can be spread as far apart in time as needed for good synchronization performance, at the cost of increased RAM usage since old data blocks will take more time to be discarded.

Unfortunately, QSBR is hard to use in practice because real-world code does not have clearly defined quiescent states, and signalling them manually exposes the application to two risks:

  • Memory unsafety, which occurs if a quiescent state is incorrectly signalled when data is being accessed inside of lock-free data structures
  • Memory leaks, which occur if readers forget to signal quiescent states.

For this reason, safety-minded people usually need to fall back to worse-performing schemes which force threads to communicate with each other on each lock-free transaction, such as Epoch-Based Reclaimation (as used by crossbeam)... or lock-free reference counting.

Enter my crazy idea, which is to build a thread pool with QSBR in mind and leverage Rust's borrow checker to guarantee that the use of QSBR is memory-safe.

A thread pool interlude

Thread pools are almost always the right way to implement thread-based parallelism in throughput-oriented programs. Among other things, well-designed thread pools provide a solution to thread count explosion, CPU oversubscription, load imbalance, and excessive thread start/stop overhead in fork-join programs.

A thread pool works by spawning a finite number of OS threads (usually equal to the number of CPU cores/threads on the host), and having each of these act like a primitive batch OS scheduler, fetching jobs from a queue and running them from start to finish in a loop.

This means that thread pool jobs must follow a couple of extra rules with respect to OS threads in order to be efficient (essentially avoid any action that would cause the underlying OS thread to block). But that needs not concern us here. The important part is that whenever we use a thread pool, we guarantee by design that no new job will be started by the host OS thread until the current job is finished.

Therefore, thread pools have two desirable properties as far as QSBR is concerned:

  • We know how many threads are present in the thread pool.
  • With a small tweak in the thread pool implementation, we can tell when each OS thread in the pool is done with its current work and ready to move to the next job.

Now, add to that a way to prevent a thread from weaseling out a pointer to the lock-free data structures in global state, and we get a trivial way to implement memory-safe QSBR!

Towards QSBR-friendly thread pools?

Wrapping it up, here is my idea in a nutshell:

  • Jobs in a thread pool get as input a reference to an external "JobContext" object, which conceptually acts as if it were spawned by the underlying OS thread at the start of a job and destroyed at the end.
  • To access a QSBR-based lock-free data structures, jobs must provide a reference to the active JobContext. Any output reference has its lifetime tied to that of the JobContext, and is guaranteed by the borrow checker not to outlive it. This means that...
    1. Thread pool jobs cannot weasel out references to lock-free data structure internals in global variables (or, in general, in state which outlives them).
    2. Code outside of the thread pool cannot access the lock-free data structure.
  • Inbetween two jobs, the thread pool signals the QSBR implementation that a quiescent state occurred for the corresponding OS thread.

Can you think of any circumstance where this would not work, or be a bad idea in some way? The main one which I can think of myself is that the longer-lived thread pool jobs are, the more memory can accumulate in the QSBR garbage list instead of being freed, causing RAM usage explosion.

This could be adressed by adding a quiescent state signalling method to the JobContext, which takes an &mut reference to said context (ensuring that all references to lock-free data have been dropped) and signals the quiescent state proactively before the end of the thread pool job.

In this way, people who suffer from excessive RAM usage due to excessively infrequent quiescent state signalling could safely opt into more frequent manual signalling, as in "classical" QSBR implementations, while everyone else gets a basic guaranteed level of memory reclaimation from lock-free objects for free.

What do you think?

2 Likes

From Rayon's perspective, I'm not exactly sure what "inbetween two jobs" means. I suspect you don't actually mean between the jobs as Rayon defines them, as this would include all the sub-jobs you get from recursive splitting.

Maybe you mean when a particular thread has drained its local queue entirely? If so, would it work for you to have a generic quiescent callback for that? That way we don't have to know anything about specific designs like your JobContext, instead just giving you a hook in the right place.

1 Like

Hmm... I did not consider Rayon's ability to recursively spawn tasks yet. It is true that this needs a better mental model than the "FIFO job queue" which I had in my mind when writing this post. Are you thinking about something like the following scenario?

rayon::join(|| {
  let my_ref = /* Acquires inner reference to lock-free struct */;
  rayon::join(|| {
    /* Do something with my_ref */
  }, || {
    /* Do something else with my_ref */
  });
  /* my_ref is safe to reclaim at the end of this scope/job */
}, || {
  /* Some unrelated job */
});

I see two possible issues with waiting for the local thread pool to be empty before signaling a quiescent state instead of using an explicit context object to keep track of lock-free transactions:

  • It could delay quiescent state signalling, and thus memory reclaimation, very far beyond the point where it is safe to do so (at the end of the first join() arm here).
  • Perhaps more importantly, using an implicit "empty local queue" hook instead of an explicit context object does not protect against Rayon jobs weaseling out references to the QSBR lock-free data structure, which in turn could cause memory unsafety:
let lock_free_struct = /* ... */
let mut evil_vec = Vec::new();
rayon::join(|| {
  let evil_ref = lock_free_struct.get_internal_reference();
  evil_vec.push(evil_ref);
  // Rayon would incorrectly signal a quiescent state here, possibly
  // resulting in the data pointed to by my_ref being freed.
}, || {});

I think a more promising hook would probably be something like crossbeam's Scope::defer(), which would schedule work to be executed by the thread pool at the end of the active scope/job even in the advent of a panic. Given such a hook, and another to know the thread pool size, all the rest of the QSBR logic (creating the context object, managing recursive scopes, signaling quiescent states...) could probably be handled in a separate crate.


EDIT: On second thought, that wouldn't work either. It wouldn't achieve the desired semantics of having threads which do not read from lock-free data structures periodically signal quiescent state. Hmm... these nested jobs are tricky.

Here is another attempt at defining the desired semantics: by default, all jobs, no matter at which nesting level, report a quiescent state at the end. But this behaviour is inhibited if a higher-level job holds a reference to lock-free data.

Need to think some more if this works, and if it could be implemented in Rayon with minimal extension hooks :slight_smile:

After sleeping on this a bit, I think that as currently spelled out, my idea cannot work on Rayon's thread pool, because the design of that thread pool effectively allows one job to temporarily preempt another. This in turn violates a core assumption of my design, namely that thread pool workers run jobs from start to finish uninterrupted.

To see this in action, consider the following code:

rayon::join(|| {
    // A: Acquire reference to lock-free struct
    rayon::join(|| {
        // B: A random subjob
    }, || {
        // C: Another subjob **that takes longer**
    });
    // D: Use reference to lock-free struct
    // E: Job finishes, quiescent state is signalled
}, || {
    // F: Write to the lock-free data structure (signals quiescent state)
});

Now, assume a pool with 2 threads...

  1. Thread 1 reaches point A, acquires a reference to the lock-free struct.
  2. Thread 2 reaches point F, modifies the lock-free struct and signals a quiescent state.
  3. Thread 1 schedules C and starts working on B.
  4. Thread 2 steals C and starts working on it.
  5. Thread 1 finishes first, notices that C is not finished and D thus cannot be executed yet, starts working on an unrelated job that eventually signals a quiescent state.
  6. At that point in time, every thread in the pool has signalled a quiescent state, so the object associated with the lock-free struct is liberated.
  7. Thread 2 finishes C
  8. Thread 1 starts executing D on a dangling reference

Basically, the heart of the issue is that at step 5, thread 1's current job effectively gets preempted by another, which violates the "no preemption" assumption that I made when spelling out my design.

It is worth noting that step 4 is also somewhat problematic, if not incorrect, because it effectively violates a core assumption of QSBR, namely that a thread cannot access stale state after signalling a quiescent state. This is important, because it means that my idea will never work in a thread pool runtime that allows for continuation stealing like Cilk: on such a runtime, D can be rescheduled anywhere, including on thread 2, and that would be fatal for memory safety.

Of course, Rayon does not currently perform continuation stealing, and doing so would require deeper integration with the Rust compiler (e.g. to tell if the active job's stack is Send and move it around). But with generators being now available and used for asynchronous programming, we are getting dangerously close to such integration being available. So I should at least future-proof against Rayon moving to continuation stealing someday.


In Rayon's current implementation, where continuations will eventually be executed on the thread which they originate from, I might get away with just using a different set of rules for quiescent state signaling:

  • Whenever a thread acquires a reference to lock-free data, it increments a thread-local counter.
  • Once all references are guaranteed to be dropped (i.e. when the JobContext is dropped), the counter is decremented.
  • A thread may only signal a quiescent state on job boundaries if the thread-local counter is zero.

This is pretty much the rules that are used by the RCU algorithm to handle nested scopes, modulo the fact that I think Rust's ownership and borrowing system, when combined with the "no continuation migration across threads" guarantee, allows us to safely send references to the data to other threads instead of keeping them to ourselves.

To be safe against possible future evolutions of Rayon and Rust, the JobContext object should be made !Send: this means that any future continuation-stealing implementation that would uphold Rust's type system guarantees would be disallowed from migrating a continuation from one thread to another (which is not compatible with this algorithm).

Since the counter is thread-local and never needs to be read by any other thread, I think that it can be made reasonably low overhead, all I need for an efficient implementation are three ingredients:

  • Being able to spawn an array of counters at the start of the Rayon thread pool, which is as large as the thread pool is.
  • Having access to this array during the execution of the thread pool.
  • Knowing which thread I am running on as a simple numerical index.

Anything else which I forgot?

Do you really need rayon's help to create your counter array? Why couldn't you just create this independently? We do have current_num_threads() to tell you how many are needed, or you can control this yourself if you're creating a specific QSBR thread pool.

The last point is easy with current_thread_index() -- although we don't re-export that from rayon_core into rayon. We probably should.

Ah, sorry, I was not meaning to enumerate missing Rayon features here, more enumerating what this proposal would need overall, whether Rayon-provided today or not. Though it is true that current_thread_index() is something which the host thread pool comes best prepared to tell me :slight_smile:

One open question which I would have regarding Rayon's design is if you have settled for a constant number of threads after all. I got the impression that there were discussions about spinning threads up and down at some point, which is probably what the current_num_thread() naming originates from, how did these go in the end?

Well, the "current" in current_num_threads() also refers to the current thread pool, since there can be more than one. If you're within any thread pool, it returns that pool's thread count, otherwise that of the global pool.

We haven't really decided whether we'll ever dynamically change the thread count, but I think we're leaning towards not, or at least not without the user explicitly asking for that.