Which mechanism to use for a low-latency synchronization

I'm currently trying to rustify my synthesizer whose current implementation is single threaded. My targeted design requires a bunch of heterogeneous long-living threads to synchronously start processing a common input chunk of data. Their results may be collected asynchronously. The next chunk is formed based on the collected results and the loop starts again.

You can think of the whole mechanism to be a bunch of workers waiting for a global repetitive event or clock signal to start their next iteration of work.

Each round will have to complete within 5 µs (for a sampling rate of 192000 Hz). Each input chunk depends on the result of all threads, so there's no possibility to buffer a bunch of results in order to reduce the synchronization calls.

The threads' computations by themselves are pretty cheap but they might be all different between the threads so work stealing or thread pools are not an option (afaik).

I'm just digging through std::sync to fill my toolbox, but I have no idea how these synchronization primitives might be typically backed by CPU features to guess their impact on latency and thus on the possibility to use them for near real-time programming.

So far I've chosen to use std::sync::mpsc to asynchronously collect the results after each round, which I expect to be fast enough.

For synchronously starting/waking/notifying the threads to start their computation I don't really know what to use. I had the following ideas so far:

  1. Using a std::sync::Barrier with a capacity of thread_count + 1. The control thread will invoke wait() to unlock all computation threads. This sounds easy but I have no idea how to dynamically add or remove further threads during run-time without having to re-initialize the entire thread collection.

  2. I don't know if I understood std::sync::Condvar right. I thought about wait()ing for a condition in all threads and to call notify_all() from the control thread.

  3. A spinlock would achieve the desired latency, but there might exist some dozens of threads at a time which would overstress the CPU. Is there some form of a slow spinlock that gives the CPU some hundred cycles each round to gasp for air?

  4. One std::sync::mpsc::channel or std::sync::mpsc::sync_channel per thread in a loop that wakes up all threads sequentially (quite ugly, wasteful and possibly slow)

I might try to group several computations to form fewer but bigger threads. But the code complexity would be drastically increased and a reconfiguration at run-time would introduce a hard to grasp timing behavior while the benefits are questionable.

I'd like to hear from your experience.

2 Likes

I think that 5µs for each round would be really hard to achieve – on my machine in takes about 25µs to do a one thread sleep-wake cycle. You should definitely consider grouping the samples in chunks if you want to benefit from multithreading at all. But if you really want to achieve 5µs latency, I'd try a microcontroler os some DSP (I'm assuming a regular desktop kernel). Maybe having a small-latency variant of kernel might help, but not too much, I guess.

Using multiple channels indeed seems a little bit wasteful, but you have to wake those threads anyway and it's the waking itself what is expensive. Using channel has also the really really nice benefit of auto-buffering in case when the threads can't keep up with latency (provided you have the collector on the different thread than source). So if the best your OS you can do is eg. 50µs, then your code using channels would probably get something around that latency, with about 10 samples buffering window.

1 Like

Thanks a lot for those numbers! 25 µs would be an order of magnitude too high and I can't afford to use intermediate buffers. They would accumulate badly if a signal has to travel through lots of computers.

Maybe I stick with a mostly single threaded solution using the remaining cpu-cores for precomputations to allow the real-time thread to be as light as possible.