Hi everyone! I would like to call in the other multi-threading nerds of u.r.l.o, as I just had a crazy idea which sounds too good to be true, and would like others to cross-check it. If I'm not wrong, this would have the potential to dramatically simplify the development of lock-free algorithms in Rust, one step beyond what Crossbeam already achieved, by making QSBR (Quiescent State Based Reclaimation) both safe and practical in Rust.
And as for those who fall less in the multi-threading nerd category, don't worry: I will explain everything so that others can check if my understanding is correct
Lock-free algorithms and memory reclaimation
Lock-free thread synchronization protocols are gaining popularity in scenarios where data is mostly read but infrequently written, as well as in some scenarios where shared data structures must scale as well as possible to high write rates while avoiding the inherent limitations of locks (well, up to the fundamental limit that modern CPU caching mechanisms are generally hostile to write-heavy workloads, anyway).
At the heart of most lock-free algorithms lies the following observation: the only reason why locking is needed in multi-threaded programs is that threads must never observe half-written data (a scenario known as a "data race"). If instead of modifying data in place, a writer thread can produce a modified version in private storage and atomically "swap in" a pointer to the modified version in shared storage, then no lock is required for reading and writing, and this allows eliminating many of the historical disadvantages of mutexes, at the price of encountering new problems specific to lock-free designs:
- Increased memory usage (multiple versions of the data may coexist in RAM)
- More indirection in the data structure, reducing CPU cache efficiency (as atomic writes of large data blocks requires pointer or index indirection in current CPU generations)
- Extra writer work (if in-place modification is more efficient than writing a fresh copy).
There are, however, two hard problems that one needs to face when writing general-purpose lock-free algorithms which deal with multiple readers or multiple writers:
- The first one is the ABA problem, which essentially revolves around the difficulty of detecting concurrent modifications in the joint presence of multiple writers and reuse of memory allocations (how does a writer know if another writer got there first?).
- The second one is the memory reclaimation problem, which revolves around the difficulty of detecting when a piece of data can be discarded in the presence of multiple readers (i.e. when all readers which were reading from it at the time where a new version was swapped in are done with it).
I think I might just have gotten a very interesting idea pertaining to the latter problem.
From memory reclaimation to QSBR
The difficulty of devising a good lock-free memory reclaimation scheme lies in the fact that said scheme must not re-introduce the reader-side scalability problems that we expected to do away with by getting rid of mutexes.
This is, for example, the reason why good old reference counting does not perform very well at this task: in a concurrent environment, reference counting means incrementing an atomic counter whenever a reader accesses a data structure node and decrementing said counter (or incrementing another) whenever the reader is done with the node. This ends up having prohibitive synchronization overhead on data structures with lots of nodes, such as linked lists. As a result, lock-free reference counting is only efficient in specific scenarios, and not a generally applicable algorithm.
As revealed by most performance evaluations, the best-performing lock-free memory reclaimation scheme is generally by a good margin Quiescent State Based Reclaimation (QSBR), which operates as follows:
- The writer knows how many readers are around
- Readers periodically signal that they are not reading any lock-free data structures ("quiescent states")
- Whenever a new version of data is swapped in, the writer puts the old version of the data in a garbage list and starts a "grace period", waiting for all readers to be done with the old data.
- Once all readers have gone through a quiescent state, the old data is guaranteed not to be accessed anymore and can be discarded.
The fundamental reason why QSBR works so well is that readers can choose when they communicate with other threads. They don't need to do so on every access to the lock-free data structure. Communication can be spread as far apart in time as needed for good synchronization performance, at the cost of increased RAM usage since old data blocks will take more time to be discarded.
Unfortunately, QSBR is hard to use in practice because real-world code does not have clearly defined quiescent states, and signalling them manually exposes the application to two risks:
- Memory unsafety, which occurs if a quiescent state is incorrectly signalled when data is being accessed inside of lock-free data structures
- Memory leaks, which occur if readers forget to signal quiescent states.
For this reason, safety-minded people usually need to fall back to worse-performing schemes which force threads to communicate with each other on each lock-free transaction, such as Epoch-Based Reclaimation (as used by crossbeam)... or lock-free reference counting.
Enter my crazy idea, which is to build a thread pool with QSBR in mind and leverage Rust's borrow checker to guarantee that the use of QSBR is memory-safe.
A thread pool interlude
Thread pools are almost always the right way to implement thread-based parallelism in throughput-oriented programs. Among other things, well-designed thread pools provide a solution to thread count explosion, CPU oversubscription, load imbalance, and excessive thread start/stop overhead in fork-join programs.
A thread pool works by spawning a finite number of OS threads (usually equal to the number of CPU cores/threads on the host), and having each of these act like a primitive batch OS scheduler, fetching jobs from a queue and running them from start to finish in a loop.
This means that thread pool jobs must follow a couple of extra rules with respect to OS threads in order to be efficient (essentially avoid any action that would cause the underlying OS thread to block). But that needs not concern us here. The important part is that whenever we use a thread pool, we guarantee by design that no new job will be started by the host OS thread until the current job is finished.
Therefore, thread pools have two desirable properties as far as QSBR is concerned:
- We know how many threads are present in the thread pool.
- With a small tweak in the thread pool implementation, we can tell when each OS thread in the pool is done with its current work and ready to move to the next job.
Now, add to that a way to prevent a thread from weaseling out a pointer to the lock-free data structures in global state, and we get a trivial way to implement memory-safe QSBR!
Towards QSBR-friendly thread pools?
Wrapping it up, here is my idea in a nutshell:
- Jobs in a thread pool get as input a reference to an external "JobContext" object, which conceptually acts as if it were spawned by the underlying OS thread at the start of a job and destroyed at the end.
- To access a QSBR-based lock-free data structures, jobs must provide a reference to the active JobContext. Any output reference has its lifetime tied to that of the JobContext, and is guaranteed by the borrow checker not to outlive it. This means that...
- Thread pool jobs cannot weasel out references to lock-free data structure internals in global variables (or, in general, in state which outlives them).
- Code outside of the thread pool cannot access the lock-free data structure.
- Inbetween two jobs, the thread pool signals the QSBR implementation that a quiescent state occurred for the corresponding OS thread.
Can you think of any circumstance where this would not work, or be a bad idea in some way? The main one which I can think of myself is that the longer-lived thread pool jobs are, the more memory can accumulate in the QSBR garbage list instead of being freed, causing RAM usage explosion.
This could be adressed by adding a quiescent state signalling method to the JobContext, which takes an &mut reference to said context (ensuring that all references to lock-free data have been dropped) and signals the quiescent state proactively before the end of the thread pool job.
In this way, people who suffer from excessive RAM usage due to excessively infrequent quiescent state signalling could safely opt into more frequent manual signalling, as in "classical" QSBR implementations, while everyone else gets a basic guaranteed level of memory reclaimation from lock-free objects for free.
What do you think?