So, after implementing triple-buffer, I wondered how hard it would be to make it support multiple consumers, so as to support "broadcast" communication scenarios. The answer, it turns out, was "surprisingly hard". Many design compromises had to be made, and much complexity has to be added. So for the benefit of the other crazy souls writing lock-free algorithms out there, I decided to write some return on experience.
Background: Triple buffering
First, maybe it's worth clearing up what I mean by "triple buffering", because there's unfortunately a bit of terminology confusion surrounding this concept.
To many graphics developers, triple buffering means using a swap chain that is three buffers long, i.e. rendering at most two frames in advance before the frame displayed on screen, so as to ensure a more consistent framerate at the cost of higher latency. However, what I'm talking about here is not this mechanism, but another graphics programming trick, which is a lock-free thread synchronization technique that allows a producer thread to constantly update a shared memory block, and a consumer thread to constantly poll it, with the following guarantees:
- The producer and the consumer will never need to wait for one another (bounded wait-freedom)
- The consumer will never fetch an inconsistent data block (updates are transactional)
- The consumer will never fetch a data block which is older than the previous block that it received.
The basic synchronization constructs look like this:
- Let there be three buffers B1, B2, B3, each able to store one copy of the data being shared.
- At any point in time, the producer has exclusive access to one of these buffers, called the "write buffer", and the consumer has exclusive access to another buffer, called the "read buffer". One buffer remains unused, we will call it the "back buffer".
- The index of the back buffer is stored in a shared integer, which both the reader and writer can access through atomic operations.
- For reasons that will later become clear, this shared integer is actually a bitfield, where one extra bit is used to denote whether the producer or the consumer last wrote to it. We will call that bit the "dirty bit".
When the producer wants to update the shared data, it does the following:
- First, it copies/moves the updated data into its private write buffer.
- Then it atomically swaps the index of the write buffer with that of the back buffer, and sets the dirty bit.
- At this point, the value that was just written lies in the back-buffer, available for the consumer to fetch, and the producer has a new write buffer to write to whenever it will feel like it (the former back buffer).
- If the producer submits several updates in a row, without the consumer doing anything, the back-buffer will always end up containing the most recent update that was committed by the producer.
When the consumer wants to access the latest available version of the shared data, it does the following:
- First, it checks whether the dirty bit is set, to know whether new data is available in the back buffer.
- If not, the consumer simply accesses the read buffer, which is the most recent data available.
- If new data is available, the consumer acquires ownership of it by atomically swapping the index of the read buffer with that of the back-buffer, and clearing the dirty bit.
- At this point, the consumer has access to the most recent update that was committed by the producer to the back-buffer, and the producer has a new back-buffer to write to.
As you can see, this is quite an elegant synchronization algorithm when one wants to keep the illusion of a "shared variable" between a producer thread a consumer thread, without dealing with data races or the many perils of blocking synchronization. It's simple, efficient (if a bit memory-intensive), and it only requires atomic swap operations and acquire/release memory barriers.
Now, let's try to extend it to multiple consumers.
First attempt: Mimicking the Linux RCU design
If you've spent a bit of time reading about lock-free algorithms, you must have heard about the Read-Copy-Update mechanism, which is one of the crown jewels of the Linux kernel. RCU promises wait-free access to shared data structures by infinites amounts of readers, with zero synchronization overhead (well, initial fetch from the writer's CPU cache aside), even on relaxed-consistency memory architectures like POWER or ARM.
On paper, this certainly sounds very appealing. Now, where's the catch? To understand why some dirty trick has to be applied somewhere, let us put ourselves in the place of the producer thread:
- Initially, there is some version of the data structure in memory, accessible to both the producer and the consumers by a shared (atomic) pointer.
- To update the data structure, the producer creates a modified copy, then it atomically swaps the shared pointer with a pointer to the updated data structure.
- At this point, all new incoming consumer threads that fetch the shared pointer will see the new version of the data structure, without requiring any special memory barrier even on ARM or POWER...
- ...but what should the producer do with the old version of the data structure, which is potentially still in use by many readers?
Obviously, readers need to somehow communicate to the producer that they are not using the old data structure anymore. If they do so explicitly, by modifying a shared atomic variable with an appropriate memory barrier, then we're not really talking about zero synchronization overhead anymore. So the producer needs to know that the data structure fell out of use through some strange side-channel communication, which falls outside the normal mechanisms used by CPU cores to communicate.
Enter quiescent states, and a pretty horrible (IMHO) implementation trick.
The Linux RCU implementation declares that an old data structure version is not used anymore once all consumer threads have passed at least once through a quiescent state, in which they are not reading any RCU data structures. To detect quiescent state, the kernel proceeds as follows:
- Freeze the kernel scheduler on a CPU hosting any consumer thread which is accessing an RCU data structure, preventing it from being preempted by another thread or an interrupt handler.
- Whenever the data structure is modified, schedule a task to run on every CPU other than the one occupied by the producer thread.
- Once all these tasks have been run, the producer can be sure that everyone went through a quiescent state, and the old data structure can safely be discarded.
Now, this design is fine if you value performance far above anything else, but it also has many troublesome implications:
- If a consumer hangs while looking up an RCU data structure, that CPU is effectively dead from the Linux kernel's point of view. There is no way to reschedule another thread in its place.
- In the Linux implementation, this also freezes the producer, which is potentially a critical OS process, because a producer must discard the old memory block before it can insert a new one.
- So effectively, a (potentially unimportant) consumer thread can kill the (potentially much more important) producer thread just by forgetting to release the RCU data structure... Or doing so on purpose.
- And because this hack relies on low-level control of the Linux kernel's scheduling mechanism, it cannot be implemented in user-space applications. Instead, less efficient tricks such as incrementing one atomic counter per thread on every quiescent state must be used.
- In an attempt to amortize the overhead, some user-space implementations leave it up to the application to signal quiescent states whenever they feel like it (e.g. on every GUI event loop iteration). This is effectively a textbook example of abstraction leakage, which hampers future application maintainability through use of global variable-style spooky action at a distance, and increases the odds of a producer getting frozen temporarily or permanently by a misbehaving consumer.
Having thought about all this, I decided that RCU was not for me and that I wanted something more robust. I wanted something which can be wait-free for the writer if someone has enough RAM to spare, which can cope with some consumers going idle, which is a self-contained abstraction, and which doesn't require dynamic memory allocation. Even if that comes at the cost of less impressive performance.
And this is how I ended up developing SPMCBuffer.
(To be continued)