"SPMC buffer": triple buffering for multiple consumers

HadrienG · March 27, 2017, 8:13pm

So, after implementing triple-buffer, I wondered how hard it would be to make it support multiple consumers, so as to support "broadcast" communication scenarios. The answer, it turns out, was "surprisingly hard". Many design compromises had to be made, and much complexity has to be added. So for the benefit of the other crazy souls writing lock-free algorithms out there, I decided to write some return on experience.

Background: Triple buffering

First, maybe it's worth clearing up what I mean by "triple buffering", because there's unfortunately a bit of terminology confusion surrounding this concept.

To many graphics developers, triple buffering means using a swap chain that is three buffers long, i.e. rendering at most two frames in advance before the frame displayed on screen, so as to ensure a more consistent framerate at the cost of higher latency. However, what I'm talking about here is not this mechanism, but another graphics programming trick, which is a lock-free thread synchronization technique that allows a producer thread to constantly update a shared memory block, and a consumer thread to constantly poll it, with the following guarantees:

The producer and the consumer will never need to wait for one another (bounded wait-freedom)
The consumer will never fetch an inconsistent data block (updates are transactional)
The consumer will never fetch a data block which is older than the previous block that it received.

The basic synchronization constructs look like this:

Let there be three buffers B1, B2, B3, each able to store one copy of the data being shared.
At any point in time, the producer has exclusive access to one of these buffers, called the "write buffer", and the consumer has exclusive access to another buffer, called the "read buffer". One buffer remains unused, we will call it the "back buffer".
The index of the back buffer is stored in a shared integer, which both the reader and writer can access through atomic operations.
For reasons that will later become clear, this shared integer is actually a bitfield, where one extra bit is used to denote whether the producer or the consumer last wrote to it. We will call that bit the "dirty bit".

When the producer wants to update the shared data, it does the following:

First, it copies/moves the updated data into its private write buffer.
Then it atomically swaps the index of the write buffer with that of the back buffer, and sets the dirty bit.
At this point, the value that was just written lies in the back-buffer, available for the consumer to fetch, and the producer has a new write buffer to write to whenever it will feel like it (the former back buffer).
If the producer submits several updates in a row, without the consumer doing anything, the back-buffer will always end up containing the most recent update that was committed by the producer.

When the consumer wants to access the latest available version of the shared data, it does the following:

First, it checks whether the dirty bit is set, to know whether new data is available in the back buffer.
If not, the consumer simply accesses the read buffer, which is the most recent data available.
If new data is available, the consumer acquires ownership of it by atomically swapping the index of the read buffer with that of the back-buffer, and clearing the dirty bit.
At this point, the consumer has access to the most recent update that was committed by the producer to the back-buffer, and the producer has a new back-buffer to write to.

As you can see, this is quite an elegant synchronization algorithm when one wants to keep the illusion of a "shared variable" between a producer thread a consumer thread, without dealing with data races or the many perils of blocking synchronization. It's simple, efficient (if a bit memory-intensive), and it only requires atomic swap operations and acquire/release memory barriers.

Now, let's try to extend it to multiple consumers.

First attempt: Mimicking the Linux RCU design

If you've spent a bit of time reading about lock-free algorithms, you must have heard about the Read-Copy-Update mechanism, which is one of the crown jewels of the Linux kernel. RCU promises wait-free access to shared data structures by infinites amounts of readers, with zero synchronization overhead (well, initial fetch from the writer's CPU cache aside), even on relaxed-consistency memory architectures like POWER or ARM.

On paper, this certainly sounds very appealing. Now, where's the catch? To understand why some dirty trick has to be applied somewhere, let us put ourselves in the place of the producer thread:

Initially, there is some version of the data structure in memory, accessible to both the producer and the consumers by a shared (atomic) pointer.
To update the data structure, the producer creates a modified copy, then it atomically swaps the shared pointer with a pointer to the updated data structure.
At this point, all new incoming consumer threads that fetch the shared pointer will see the new version of the data structure, without requiring any special memory barrier even on ARM or POWER...
...but what should the producer do with the old version of the data structure, which is potentially still in use by many readers?

Obviously, readers need to somehow communicate to the producer that they are not using the old data structure anymore. If they do so explicitly, by modifying a shared atomic variable with an appropriate memory barrier, then we're not really talking about zero synchronization overhead anymore. So the producer needs to know that the data structure fell out of use through some strange side-channel communication, which falls outside the normal mechanisms used by CPU cores to communicate.

Enter quiescent states, and a pretty horrible (IMHO) implementation trick.

The Linux RCU implementation declares that an old data structure version is not used anymore once all consumer threads have passed at least once through a quiescent state, in which they are not reading any RCU data structures. To detect quiescent state, the kernel proceeds as follows:

Freeze the kernel scheduler on a CPU hosting any consumer thread which is accessing an RCU data structure, preventing it from being preempted by another thread or an interrupt handler.
Whenever the data structure is modified, schedule a task to run on every CPU other than the one occupied by the producer thread.
Once all these tasks have been run, the producer can be sure that everyone went through a quiescent state, and the old data structure can safely be discarded.

Now, this design is fine if you value performance far above anything else, but it also has many troublesome implications:

If a consumer hangs while looking up an RCU data structure, that CPU is effectively dead from the Linux kernel's point of view. There is no way to reschedule another thread in its place.
In the Linux implementation, this also freezes the producer, which is potentially a critical OS process, because a producer must discard the old memory block before it can insert a new one.
So effectively, a (potentially unimportant) consumer thread can kill the (potentially much more important) producer thread just by forgetting to release the RCU data structure... Or doing so on purpose.
And because this hack relies on low-level control of the Linux kernel's scheduling mechanism, it cannot be implemented in user-space applications. Instead, less efficient tricks such as incrementing one atomic counter per thread on every quiescent state must be used.
In an attempt to amortize the overhead, some user-space implementations leave it up to the application to signal quiescent states whenever they feel like it (e.g. on every GUI event loop iteration). This is effectively a textbook example of abstraction leakage, which hampers future application maintainability through use of global variable-style spooky action at a distance, and increases the odds of a producer getting frozen temporarily or permanently by a misbehaving consumer.

Having thought about all this, I decided that RCU was not for me and that I wanted something more robust. I wanted something which can be wait-free for the writer if someone has enough RAM to spare, which can cope with some consumers going idle, which is a self-contained abstraction, and which doesn't require dynamic memory allocation. Even if that comes at the cost of less impressive performance.

And this is how I ended up developing SPMCBuffer.

(To be continued)

HadrienG · March 28, 2017, 9:49am

Lock-free reference counting

After I decided that I was not performance-obsessed enough to go as far as the Linux RCU did, I still needed to figure out a good way to solve my consumer tracking problem. An obvious candidate for these requirements was good old reference counting:

When accessing some data, a consumer increments a counter
When done, it decrements it
When the counter goes to zero, the data can be discarded...
...but of course, it should only be discarded if a newer version has replaced it

As always, however, the devil lies in the details:

Where should the reference count be located?
Who should be responsible for discarding unused data?
How could one avoid the significant overhead of two atomic read-modify-write operations and acquire/release memory barriers on every read?

Let's look at the first two issues

Reference count location & responsibility

There are two obvious places to put the refcount in:

Alongside the version of the shared data that is being accessed
Alongside the pointer to the latest version of the shared data

The first option does not work, because with it, the consumer algorithm would in one way or another boil down to the following...

Read the shared pointer in order to locate the latest data
Increment its reference count
Start reading the rest

...and this minimal formulation has a race condition in it: between the time where the consumer has read the shared pointer, and the time where it tries to increment the reference count, the target data structure may have been freed!

The second option allows the reference count to be incremented before the shared data is accessed, which as we have seen is needed. However, because the reference count is not tied to a specific version of the shared data, the other threads only know how many consumers are in flight, not what version of the data each consumer is accessing. This means that the resulting design is equivalent to a reader-priority RWLock: as long as there is at least one consumer in flight (which can be forever if a thread hangs or is scheduled away), one cannot liberate any memory.

Obviously, neither of these options is satisfactory. What we'd like to have instead is a design where a consumer can increment the reference count of the right version of the shared data, without knowing anything about what this "current version" is. But how is that even possible? Bit tricks to the rescue!

Let's replace the shared pointer by a shared bit field, whose high-order bits indicate where the latest version of the data is located (we'll later see how to compress this information), and whose low-order bits act as a reference count. Here's an example layout for a 64-bit shared bitfield:

 63 ............ 32   31   30 ........... 0
+-------------------+----+-----------------+
| Latest buffer ptr | OF | Reference count |
+-------------------+----+-----------------+

Note that because I'm a careful person, I have added an overflow bit to the reference count. This means that in debug mode, I can check that the reference count does not overflow and corrupts the latest buffer location. This may seem a bit overkill for such a large bitfield, but will come in handy when going for smaller integer sizes (memory bandwidth is not cheap!)

With this design, a consumer can access the latest buffer AND increment its reference count in a single fetch-add atomic operation. And on its side, the producer can know how many consumers have accessed a memory block at the time where it swaps in a new version of the shared data, because as a result of this atomic swap, it will get both the pointer to the old buffer AND the last reference count.

Now, all sweet things come with tradeoffs, and this design has two major drawbacks:

We cannot choose who is responsible for discarding old data anymore. It has to be the producer. Which was probably the most sensible choice anyhow:
- Random memory reclamation pauses in consumers make it hard to reason about performance
- Only the producer can tell which data is unreachable without any communication
Much more important, we can only increment the reference count, not decrement it, which effectively makes it more of an "incoming reader count"!

The reason why decrementing the reference count is unsafe is that between the time where a consumer has accessed the latest version of the data structure, and the time where it is done and tries to discard it, a new version of the data structure may have been swapped in by the producer. So trying to decrement the shared bitfield may actually result in decrementing the wrong refcount, and possibly in bitfield underflow!

As an alternative, we will increment a second counter, located inside of the shared data structure, which represents how many consumers are done with it ("outgoing reader count"). In this updated design, we will know that an unreachable data structure is unused and can be discarded once the amount of outgoing readers becomes equal to the amount of incoming readers at the time where it was replaced by the producer.

And this addresses both the questions of how we can reference-count the shared data, and who should be responsible for discarding it.

Reducing consumer usage of fetch-add

Now, I previously mentioned that there is an intrinsic performance issue with lock-free reference counting, which is that asking readers to increment/decrement a refcount on every access, with appropriate acquire/release memory barriers, is...

Intrinsically costly (atomics read-modify-write ops usually take 50~100 CPU cycles, memory barriers can be even worse on architectures with relaxed consistency like POWER and ARM)
Especially bad in contended scenarios (in cache-coherent architectures, CPU cores can end up moving around the cache line associated with the refcount a lot)

In the specific design which I described above, there is one extra issue with incrementing the refcount on every access: we don't want to overflow the shared bitfield, and since readers cannot decrement the refcount, such overflow could occur even with few consumers around!

Thankfully, atomic reads are (almost) free on every popular CPU architecture, so we can avoid the overhead and overflow risks of doing a fetch-add on every read quite easily through the following modifications:

Consumers do NOT discard the data structure immediately once they are done reading it. Instead, they just cache its location internally.
A consumer starts every readout by checking the shared bitfield in order to see if an update has been submitted by the producer.
If not, it simply reads from the cached buffer location.
If so, it goes through the usual fetch-add proces AND discards the old data buffer.

In this design, a consumer will only increment the reference count at most once per producer update, which remains reasonably light on the memory bus. And we have solved our overflow problem, at the cost of increased memory usage in the event where consumers do not check for updates very often.

Now, how far are we with the design promises that I made?

Wait-free reads: OK
- Consumers always complete in a fixed and finite amount of memory operations
Wait-free writes if there are no memory usage constraints: Partial
- I have explained how the producer can avoid waiting for consumers before inserting new memory blocks, by refcounting at the granularity of individual shared data versions, but not how the memory is actually discarded.
Can cope with consumers going idle: OK
- At worst some old buffer's liberation will be delayed.
Is self-contained: OK
- The entire reader tracking logic is contained in the atomic operations that producers perform on writes and consumers perform on reads.
Doesn't require dynamic memory allocation: Nope
- I will explain this one together with the memory reclamation mechanism.

(To be continued)

HadrienG · March 28, 2017, 8:09pm

Storage management

So far, I have defined ways for a producer to...

Atomically publish an update to a shared data structure, whose memory location is identified by an integer.
Know when the previous versions of the data structure have been discarded by their consumers, and can be safely reclaimed.

What's still missing from this picture, at this point, is storage management:

How storage is allocated (aka what the aforementioned integer identifier points to)
How storage is reclaimed after use

From bounded storage...

There are many ways to manage storage, depending on one's requirements. Personnally, since my experience of dynamic memory allocation is that implementations vary enormously in quality and performance, and that it is extremely easy to become bottlenecked by it, I decided that I wanted none of it at run time. So I went for a design that allocates a fixed-size pool of buffers at initialization time, and then continuously reuses them, as in the triple buffer case.

In such a bounded allocation design, there are two things to take care of:

Find out what is the "ideal" amount of buffers for some usage scenario
Degrade nicely when not enough buffers are allocated

Let's start with the former. In an ideal world, the answer to this question would be "2+N buffers" where N is the amount of readers. In the worst case, this allows each reader to have its own read buffer, and the writer to have 2 buffers to flip between on every write. And in a strict generalization of the triple buffer design, that would be enough for everyone to be wait-free...

...but sadly enough, that's not the actual answer. The answer is that as currently implemented, in order to be provably wait-free in any circumstance, SPMCBuffer would require at least 2+2xN buffers, and possibly more depending on architecture specifics. We'll come back to why later on.

This can be improved upon, but at the cost of degraded performance in the common case. Running out of buffers should really be considered as an edge case originating from very unfortunate reader behaviour or scheduling. So I would not consider optimizing against it to be a great thing to do, except in scenarios where writer lock freedom is considered absolutely essential.

If at any point in time, a writer finds itself without a spare buffer to write to, the current SPMCBuffer implementation will have it spin by waiting for one to become available in a loop. This technique is inefficient for long waits, but reflects the fact that if enough buffers have been allocated, the producer will never need to wait long enough for mutexes and condvars to become worthwhile.

...to the tradeoffs of optimized atomics ordering

To understand why writer wait-freedom cannot be guaranteed without taking a performance hit, let's look at (a slightly simplified version of) the consumer's algorithm:

fn access_read_buffer(&mut self) -> &Buffer<T> {
    // Determine the index of the latest buffer
    let init_bitfield_value = self.shared.bitfield.load(Ordering::Relaxed);
    let latest_index = init_bitfield_value.bitand(INDEX_MASK)/INDEX_MULTIPLIER;

    // Has an update occurred since our last read?
    if latest_index != self.read_index {
        // If so, increment the latest buffer's refcount, get access to it, and
        // synchronize with the buffer updates from the producer
        let bitfield_value = self.shared.bitfield.fetch_add(1, Ordering::Acquire);

        // Discard our current read buffer. Correct memory ordering is guaranteed
        // by the preceding Acquire memory barrier.
        let ref read_buffer = self.shared.buffers[self.read_index];
        read_buffer.outgoing_count.fetch_add(1, Ordering::Relaxed);

        // Make the latest buffer our new read buffer
        self.read_index = bitfield_value.bitand(INDEX_MASK)/INDEX_MULTIPLIER;
    }

    // Return a reference to the current read buffer
    self.shared.buffers[self.read_index]
}

Note the memory orderings there:

The first load from the bitfield is relaxed, and can thus be freely reordered before the start of the function by the compiler or the hardware (but not after the if, since its condition depends on that load). This is okay: we can afford to miss an update from the producer, we'll just get it on the next lookup.
The bitfield fetch-add uses acquire ordering, which is necessary in order to synchronize with the latest buffer's state. As a side effect, this acquire memory barrier also prevents any subsequent load or store from being reordered before it.
The outgoing reader count fetch-add uses relaxed ordering. In principle, this is dangerous: it could allow the buffer liberation to be reordered anywhere before or after in the reader's thread, including at a point in code where the reader might still be reading the old buffer. But in practice, the preceding Acquire fence prevents this from happening.

To summarize, I manage to get away with a single acquire memory fence, which can be a significant optimization on architectures with relaxed memory consistency like ARM and POWER, by carefully ordering the former read buffer's liberation after the reservation of the new buffer and the acquire fence that comes with it.

But this comes at a cost: the consumer reserves a new buffer before liberating the old one, which means that if a very unlucky and unlikely sequence of events occurred, and all consumers were frozen inbetween these two steps, they could be using two separate buffers each, leading to the 2+2xN bound given previously.

And as mentioned, the actual situation is even worse than this. Because the buffer liberation fetch-add uses relaxed ordering, some incredibly evil compilers and hardware would be allowed by the C11 memory model to reorder it as far ahead of the Acquire fence as they like, including after the reservation of another buffer. So in principle, a consumer could end up reserving multiple buffers before liberating a single one!

Now, the odds of this actually happening are actually pretty low. Real-world consumers do a lot more than reading from SPMC buffers in a loop. Write buffers fill up and are flushed to lower-level caches. OS schedulers do not conspire to ensure that consumers eat up as much memory as possible. Compilers do not reorder relaxed atomic operations thousands of lines of code after their initial location as a hobby.

But if we really wanted to prevent such scenarios from happening at all cost, we'd have to modify the code like this:

    // ...
    if latest_index != self.read_index {
        // Discard our current read buffer. Acquire-release ordering prevents this
        // operation from being reordered before the preceding reads from that buffer,
        // or after the reservation of our next buffer
        let ref read_buffer = self.shared.buffers[self.read_index];
        read_buffer.outgoing_count.fetch_add(1, Ordering::AcqRel);

        // Increment the latest buffer's refcount, get access to it, and
        // synchronize with the buffer updates from the producer
        let bitfield_value = self.shared.bitfield.fetch_add(1, Ordering::Acquire);

        // Make the latest buffer our new read buffer
        self.read_index = bitfield_value.bitand(INDEX_MASK)/INDEX_MULTIPLIER;
    }
    // ...

With such acquire-release ordering on the consumer side, and matching memory barriers on the producer side, we would have provable writer wait-freedom with 2+N buffers, because a consumer would always be perceived by the producer as releasing its old buffer before acquiring a new one.

But the cost of such extra memory barriers could be prohibitive on anything more relaxed than x86, only to prevent an infrequent event from causing the writer to spin for a while.

In the end, as I said, I do not consider the trade-off to be worth it.

Conclusion

So, that was my experience extending my triple buffering primitive to support multiple consumers. It took me through a strange trip involving discovering the hidden ugly face of Linux RCU, implementing lock-free reference counting in a manner which is reasonably efficient and which allows more than two buffers in flight, and exploring the tradeoffs of combining relaxed atomics ordering with bounded memory allocations.

If you want to play with the result, please don not hesitate to go have a look at the source and the associated crate.

Topic		Replies	Views
Code review: TripleBuffer for sending huge objects between threads help	18	4168	January 12, 2023
Triple buffer synchronization primitive	20	3267	January 12, 2023
Atomics: Triple Buffer	23	1540	February 6, 2021
Dirty memory reads and mpmc (low-latency communication between threads) help	9	1270	November 14, 2019
Triple buffering in Rust weaknesses? help	2	420	November 8, 2019