Lock-free threadsafe ring buffer

I want to either find or create a lock-free ring buffer. I don't want any memory fences on the write side, because it is in a realtime thread. jack audio has such a ringbuffer in its source code that doesn't seem to use any synchronization, but then isn't it susceptible to data races?

My question is: is the jack ringbuffer implementation undefined behaviour, and if not, why not? If it is, is there any way to copy data out of a RT thread that isn't UB?

I think this is UB, even in C. It certainly is UB in Rust, because cross-thread access without UnsafeCell and/or synchronization is not allowed.

I think you could use [AtomicU8] as the backing store and use Ordering::Relaxed to legally get a buffer that can be written to without sync, and at least in x86 it'll compile to normal access without fences, etc.

Never mind what Rust or C think is UB. It used to be in the simple days before caches, memory models, and whatever other complexity, and programs written is assembler, that one could create a cyclic buffer, FIFO, between one producer and one consumer without any need for mutexes, locks, atomic operations at all. Even running on different cores.

I'm not sure where this all went wrong, in the hardware of our compiler optimizers or what.

Well, one of the places where it went wrong is when compilers decided it was a good idea to reorder operations I guess. Now you gotta use atomic instructions to tell them not to.

It's no good if your compiler reorders the memory access with the locking of the mutex.

2 Likes

Is it even possible now a days to tell a compiler to:

  1. Do this.
  2. Do the other.
  3. Do something else.

In that frikken order. How simple could it be? No, it is not UB, it is what I want you to do!

All of programming is only "sequence, selection and iteration". We are only on step one and apparently getting that wrong!

If we can ever get the compiler to do what we say, can we then get modern hardware to do what the compiler says without having it's own odd ideas about reordering things?

Sorry, being a bit grumpy there. Overwhelmed by the complexity of trying to do simple things now a days.

As long as the program behavior is indistinguishable from a program that actually does do them in that order, surely it's fine!

Locks are not the same as memory fences. Lock-free means the thread doesn't block. Atomics are used to implement lock-free algorithms, but they do ensure memory operations cannot be re-ordered.

You could use a bounded channel from the crossbeam crate - that is an optimised lock-free ring buffer. It won't block on write (=send) unless the buffer is full.

5 Likes

I think that right there hits the nail on the head and describes the whole problem.

In the closed circle of whatever abstract programming language world one lives in, with even such things as threads, one runs ones code, checks the inputs against outputs, has all kind of tests in place. In matters not what optimization goes on in the compiler or processor, as long as the results are the same all is well. It matters not what operations happen in what order, or even if they get optimized away completely. You can't tell the difference so it's all the good right?

Except, what happens when ones program is talking to something outside that cozy world? Something the compiler, even the processor, knows noting about. Something that expects things to happen in the right order and even at the right time?

At that point I want to wrap a block in "unsafe" and have it mean: "Do what I say, in the order I say it, such that anyone in the outside world sees what I do".

This is why we have so many old timers clinging to C still. They think they have control over what happens (Never mind that their compilers and processors have been conspiring against that idea for decades now).

Well, if there wasn't all those pesky conditions we had to uphold to have the "indistinguishable" part become true, then it would be fine.

I think you see the problem.

On the inside of the circle of abstraction things are "indistinguishable". However those guys outside, hardware and perhaps software, can tell when you are cheating.

As we are talking of a lock free ring buffer, I'm reminded that there is a serial port driver for the Parallax Inc. Propeller, multi-core, micro-controller, that does exactly that in 512 times 32 bit instructions. Most of which are used to implement the full duplex serial port bit-banging not the ring buffer. The Propeller even has LOCK instructions, almost never used.

I think you can still do that if you turn off all compiler optimizations and and then set affinities so that all your threads run on the same core. But you may notice a performance degradation.

2 Likes

IIRC, @HadrienG was developing a no_std lock-free thread-safe ring buffer. I don't know whether that memory is correct, whether he completed the design, or whether he published it.

1 Like

I'm not too familiar with ring buffers. I imagine you use it as a fixed-capacity queue? If that's the case, do you need any operations besides push and pop? How should push behave, if the ring buffer is full? How should pop behave, if the ring buffer is empty?

You can't actually turn off all compiler optimizations. For example, some optimizations happen as part of register allocation, and you can't skip that if you want to produce assembly at all.

I'm pretty convinced the design was right, but I ran out of motivation before finishing the implementation :wink:

The general idea was to write a RT-safe logger by using abomonation to serialize logs into pre-allocated storage and deserialize them in place, as demonstrated here : https://github.com/HadrienG2/rt-logger/tree/master/src .

Which led me to do a lot of work on reducing UB in abomonation... that unfortunately never got merged due to lack of maintainer availability : https://github.com/TimelyDataflow/abomonation/pulls .

I was not happy about using fixed-size memory blocks for log storage as done in the archived version of rt-logger, because sometimes that was too much, and sometimes that was too little. Which also had a performance impact because I used a crossbeam channel, not a custom one. Which meant to move a full memory block into the channel on emission and move it out on reception, no matter how much of that block I was using. I could have resolved the latter problem by sending &'static references to statically allocated blocks around, but that would have meant using some kind of static mut storage, which meant writing some unsafe to avoid locks anyway, and it wouldn't have resolved the former problem of inadequate block size.

So I set out to write a realtime safe simple dynamic memory allocator that was suitable for this need: https://github.com/HadrienG2/bacara . The idea was then to modify rt-logger to allocate memory, serialize the input data in there, and send a pointer to that memory through the channel on message emission. Then, on emission, my custom channel would deserialize the data in place, pass a pointer to that to another callback, and deallocate the memory after the callback is over.

But I ran out of energy for doing this on my spare time at some point during the implementation of that allocator, more precisely while writing the unit tests before entering the performance testing phase.

I still have planning notes around if someone feels like taking this over, though.

But maybe for what you're doing, just using a bounded crossbeam channel directly would work just fine?

So I think I need to say a little more about the problem I am trying to solve.

I'm playing with the mozilla deepspeech (now STT) speech-to-text ML model. I'm experimenting with a trigger word, similar to Amazon's Alexa. I want to try a few implementation strategies, but for now my strategy is to periodically read audio data from a buffer every 0.2 secs (the 0.2 is a knob I can tweak) to see if it matches a trigger word.

My audio setup at the moment uses jack so my audio input will be a jack client with 1 audio-in port. Jack works by calling a user-supplied callback every time there is data available. It is expected that the callback runs quickly: any heavy duty computation should happen on another thread. Otherwise more audio data might be available before the previous data has been processed, and some data could be lost, causing e.g. audible artifacts. Jack supplies a ring buffer structure for this purpose.

The fact that we need to handle the sound quickly precludes a lot of technologies. If we were to use a Mutex to safely exchange data with another thread, we are not in control of whether we need to wait before processing more audio data - we might be blocked. Jack's ring buffer doesn't use any synchronization at all, and therefore has undefined (equivalently implementation defined) behavior in places. If we know what hardware/OS we are running on, and what C compiler we are using, we could rely on this behavior, but cross-platform code cannot.

I'm experimenting with different ringbuffer designs. My current idea is to simply write any audio data into the buffer as it becomes available. If I get to the end of the buffer I start again at the beginning. It may be that a write is in progress when a read occurs in another thread, but because I'm looking for a trigger word I don't care if I occasionally get garbage: I'm going to try again in 0.2 seconds. The reader just gets a reference to the whole buffer (in 2 halves since the previous write might not ended at the end of the buffer). The code I've got works, but I'm pretty sure its UB. I can post it if that's useful. One thing I want to happen is for my hardware to be able to do memcpys when I move data into the ring buffer, rather than copying each byte individually. This is a worry I would have with using AtomicU8, but maybe I'm underestimating the optimizer. :slight_smile:

Anyone know if mmap anonymous escapes from being UB with multiple threads? (ie just nondeterministic without sync) (and what (if any) overhead it causes.)

Also any idea about Java memory model, if/how that differs.

Your problem does not require a generalized lock-free thread-safe ring buffer; a much simpler solution is available.

You are describing a periodic writer that overwrites an entry in a circular ring buffer every knob seconds (e.g., 0.2 s), and an asynchronous reader that wants to stay just behind the writer. Both reader and writer can read the system clock and use that as the basis for synchronization. The shared sense of time permits the writer to determine the buffer slot based on the time of intended writing, and the reader to know which buffer slot should be the slot currently being written, and thus stay one or more buffer slots behind that point.

Just remember to start the reader at least one slot-interval after you start the writer (and give them a common time-based starting index). The rest is straightforward.

2 Likes

@TomP Unfortunately common operating systems don't give such guarantees of when code will run or when CPU caches get read from and written out to memory.

Programmers have to make use of atomics and ordering to avoid UB when not using locks. From my post above there may be code that does not lead to UB; I just lack such low level knowledge.

I was not trying to provide working code for the OP, but merely to point out that the OP's problem was much simpler than a generalized lock-free thread-safe ring buffer.

You are correct that attention does have to be paid to synchronizing write-before-read access across multiple cores. (Cache synchronization issues do not arise when the multiple threads run on a single core.) A fully-general two-core solution would align the buffers with L1 cache lines and add a written_but_not_yet_read toggling marker within each cache line of the buffer. The existence and checking of such an in-memory marker in each cache line is sufficient to ensure the alternation of reads and writes, even if the cores do not implement atomics, because L1 cache-line access is essentially unitary with respect to the associated core.

If the architecture provides atomics or LR/SC (load reserved, store conditional) or similar primitives, those can be used for the checking. Otherwise a brief spinlock/retry can suffice.

Since minimum L1 cache line length is architecture-specific, the details also will be architecture-specific. However, it seems likely that OP's need is for a product and thus a specific architecture, so the requirement to tune to L1 cache line length should not be too onerous. A reasonably-safe default assumption for L1 cache line length would be 32 B (i.e., 256 b, four u64s).