Any real-time safe logger around?

So, I'm currently dipping my toes in audio visualization, and discovering the fun challenges of Real Time Done Right where one tries very hard to fully ban unbounded-time work (system memory allocator, blocking I/O and synchronization...) from the timing-sensitive threads.

These constraints lead to less straightforward code that is more likely to have weird and hard-to-reproduce bugs, so I think excellent diagnostic information is crucial. Hence I'd like to have some logging in my RT threads. But I obviously can't just drop in a general-purpose logger there, because those do very RT-unsafe things. What could maybe work, however, is to use an asynchronous facade that feeds a dedicated logging thread in an RT-safe way.

Here's my vision of an ideal asynchronous logger for this purpose:

  • Threads that send logs won't ever block, do I/O, or allocate memory. All they do is to serialize their logs into a lock-free MPSC queue, which is asynchronously processed by the logging thread.
  • A queue that neither blocks nor allocates must have a finite capacity and drop messages when it's full. Personally, I'd like to give new messages priority over old ones.
  • Log queue overrun is a serious issue that should be detected and reported by the logging thread.

So far, my research on crates.io has not turned up any suitable crate. Some like slog-async get really close, but even that one allocates memory all over the place. Of course, one solution would be to write an RT-safe memory allocator (using fixed-size memory blocks that are allocated during application startup, managing those the arena way, and panicking if they get all used up), and inject it as the global allocator of my application. But if I can avoid this level of guru hackery and messing around with magical global state, I would rather do so.

So, is anyone aware of an existing effort towards RT-safe logging? Or should I write a crate for that myself?

I think the main problem here is that logging frameworks assume that formatting will be done by a backend, this is why they reach for allocations so fast. While in your case log messages from RT threads must have a fixed size and can not use String or Vec. So I think you will have to write a custom solution for this problem, i.e. find a suitable lock-free ring-buffer based MPSC queue, define a log message type which will be sent over this queue and implement a receiver thread which will accept those messages and will forward them to log macros. And it would be definitely interesting to see resulting code in a stand-alone crate.

1 Like

I just went a bit to the drawing board, and I think that messages actually do not need to be fixed-size. All they really need is to be serializable, and to know an upper bound of their size in bytes in advance...

...however, the support data structures and algorithms for variable-sized messages that I came up with are nasty, so if I end up implementing this, the first prototype will definitely use fixed-size messages over a standard MPSC bounded queue implementation :wink:

Ideas for variable-sized messages

Basic protocol for variable-sized push is as follows:

  1. Determine how much space you need
  2. Allocate a block at least this large inside a fixed-sized storage block
  3. Fill it up using your favorite message serialization protocol
  4. Send coordinates of the message's storage slice over a classic MPSC queue
  5. If allocation or send fails, increment buffer overrun counter

Once this nasty part is taken care of, popping is straightforward:

  1. Pop message slice from the MPSC queue
  2. Read data from the slice and deserialize it (can actually use memory allocation here, we're on the logging thread's side)
  3. Liberate storage block after use
  4. Notify caller of buffer overruns, if any, and reset the counter

The tricky bit here is of course push step #2, which effectively amounts to implementing a concurrent memory allocator with liberation support. It doesn't need to be the fanciest allocator in the world, a bitmap allocator with a small block size would do just fine for this undemanding task. But it's still quite a piece of work.

Further, while sketching this, I also reached the conclusions that my desired semantics of dropping old messages to make room for new ones would be a royal pain to implement, especially in a lock-free way and for variable-sized messages. I don't care that much about it since dropping messages should never happen, so I'll give up on that requirement.

2 Likes

See also the same question in the rust-audio discourse group.

I have another idea how messages of various sizes can be sent
trait GetAndWriteLogs {
    get_and_write_logs(&self);
}
  • At setup time (while you can still allocate in the realtime thread):
    • In the realtime thread, you create a (Sender<MyStructuredLogMessage>, Reciever<MyStructuredLogMessage>) pair, but the Reciever<MyStructuredLogMessage> in a Box<dyn GetAndWriteLogs> and send it to the logging thread.
    • In the logging thread, you recieve the Box<dyn GetAndWriteLogs> and put them in a Vec.
  • While running
    • In the realtime thread, you send the MyStructuredLogMessage's with the Sender<MyStructuredLogMessage>.
    • In the logging thread, you loop over the Vec and run get_and_write_logs for each element.

The upside of my idea is that it seems easy to implement; I think I would only need a couple of hours (I'm being naively optimistic here). The downside is that there's a lot of boilerplate during startup.

Also relevant: this PR about adding a wait-free spsc to crossbeam.

Do you have a plan to block the logging thread when no message is incoming and wake it up again when logs arrive? It seems difficult to avoid busy waiting in the design that you are proposing.

Typically busy waits aren't too bad if you use yield_now.

How do you feel about atomic integer operations wrt. the audio thread?

Oh, I don't mind a little bit of busy waiting as long as it is exceptional, bounded in time, and better than the blocking alternatives (if any). A perfect example would be how mutexes spin a bit to resolve micro-contentions before putting the thread to sleep: given the cost of blocking, that's totally the right thing to do in this situation.

What I dislike is when software has a constant polling activity in the background for no good reason. It wastes CPU and battery power, it puts pressure on the kernel scheduler, and it's not composable in the sense that you really don't want every app to be doing it. So I try to do my part in keeping the system tidy by avoiding such behavior myself whenever I can.

That strongly depends on what atomic operations we're talking about and how they're used!

  • At one extreme, some synchronization transactions can be done with a couple of relaxed load/stores. Compiler bugs aside, that's essentially free, so I wouldn't be afraid of doing lots of them.
  • At the other extreme, non-lock-free atomic usage like spin-locks (whose simplest implementation is a single atomic instruction in a loop) are smelly in RT code in my opinion, because they force you to rely of the kernel scheduler doing the right thing.
  • And in the middle, there is a huge "it depends" gray area full of things that are normally not terribly costly, but can easily become a bottleneck if abused. Read-modify-write instructions, CAS loops with little expected contention, non-Relaxed orderings, that sort of stuff. I try to have as few of them as possible per synchronization transaction, and to provide batching optimizations if I expect a transaction using them to be carried out very often.

I'm pretty sure you cannot avoid locks in the case where you run out of pre-allocated memory, but it should be possible to be wait free as long as you have buffer space left.

If the logging operation is fallible (or just drops messages, as per OP) this shouldn't be an issue. Or have I missed another reason why it would have to block?

The design that I have in mind achieves lock-freedom on the producers' side by dropping messages when storage is full (keeping track of that so that the dev can be nagged to increase the log buffer size by the logger thread), and allows blocking on the consumer side so that the logging backend falls asleep when there's no work to do.

It should be easy to also provide a lock-free interface on the consumer side (try_pop() in addition to pop(), in channel terminology), as that's actually the natural implementation. Mixing lock-free and blocking is a delicate exercise...

Wait-freedom is harder because there are multiple producers. For one thing, some audio backends like JACK 2, which I happen to use, have multiple audio threads. For another, I also want to use this logging mechanism for other RT work like graphics rendering, ideally installing it at as a global logger so that message output is synchronized across threads and full log crate compatibility is achieved.

As soon as you have multiple producers (or consumers, FWIW), it's common for them to race for shared resources like storage, which requires wait-based arbitration mechanisms like CAS loops. My current assumption is that because logging is infrequent and there are not that many threads, this arbitration should always be resolved in one or two CAS loop cycles and will never become a problem.

In principle, wait-freedom could be achievable by giving each producer its own storage, and having the consumer scan all producers. But this is more complex and less efficient from a memory and CPU usage PoV, so I would rather not go in this direction until I have an actual problem to solve, and a benchmark proving that doing this solves it...

(Also, keep in mind that wait-freedom is more of an abstract algorithmic property than something that can actually be guaranteed 100% in practice. Due to cache coherence, even memory load/stores aren't usually guaranteed to be wait-free at the hardware level.)

[For readers who may come across this thread]
Obviously this can be conveyed by as little as one bit of status in each log entry, indicating whether or not entries were dropped since the prior entry was queued. Other alternatives for conveying that entries were dropped are also possible, depending on the log entry format.

1 Like

So, I have started investigating the first building block, log record serialization. The fact that it's not there in the log crate yet (without even an issue about it) suggests that no one has attempted this before. But if it turns out somebody actually did, and just didn't contribute it upstream, the design discussion will give that person some time to react.

I was thinking about polling, yes.

I thought the problem was interesting, and I came up with this.

It is a ring buffer based FIFO queue that gives new messages priority over old ones. It only supports one producer and one receiver.

It uses park to handle waiting for new messages.

It doesn't drop messages when it runs out of space in the queue as currently implemented, but considering your messages need to be created without allocations, it doesn't seem like it's an issue.

1 Like

Different strategies apply to different situations. Often it is the first and last errors or alarms that are most important (i.e., "how did we get here, and where are we now?"), in which case the optimal strategy is to discard the middle entries.

Another strategy is to reset the reader's upper bound – the shared write point – to a nearby forward entry (relative to the apparently current read entry), then decimate the rest of the buffer to create space for new entries, with the write point again advancing as the decimation proceeds.

1 Like

@alice Thanks for your code! I have a few ideas too, but do not have time to code them right now, so here's a design brain dump:

  • I have the impression that you are essentially reimplementing Arc. Unless you have a clear need to do so (which I did not see in the code), I would just use the standard Arc to reduce the amount of unsafe code and make auditing easier.
  • Similarly, as an alternative to breaking down the Vec into a pointer and capacity, consider using a Box<[UnsafeCell<M>]>, again for the sake for of sticking with Rust's abstraction vocabulary as much as possible to make code review easier.
  • I really need to support multiple producers, and from the bit of time I have already spent sketching the design on paper (sadly I cannot easily share my notes because they are a little messy and written in French), I know that this will affect the algorithm quite a bit. Here are a few examples:
    • Since producers are racing with each other, pushing must become an atomic transaction. The easiest way is to switch to a linked list design (because you can just CAS the back-of-list pointer). I did find other algorithms that stick with an array-based design, but they were more complex and I couldn't get them below two CAS per push transaction. Since AFAIK pointer-chasing is much less expensive than CAS, I'd rather use a linked list.
    • Storage allocation gets messy too. If multiple producers are interested in the same storage blocks, we need an atomic reservation transaction (instead of just jumping on the first storage block as in SPSC). Further, we need to decouple allocation from publication so that a slow/hung producer cannot block the other producers waiting for their turn to publish. It is possible to avoid all this by giving each producer its own storage, but that wastes RAM, complicates the consumer (since it must now juggle with blocks from several storage arenas) and breaks compatibility with the log crate by requiring each producer to to allocate its Sender during initialization (instead of just logging right away on a shared Sender). So all things considered, I'd rather not do so.
  • I would like (though unlike multiple producers, that's not a hard requirement) to support variable-sized messages. On the producer side, the interface could look like fn try_push(&self, size: usize, write: impl FnOnce(&mut [u8])) -> Result<()>. However, that makes storage allocation even more messy in the presence of multiple producers, and in particular it makes it extremely hard to implement the "new messages overwrite old ones" policy.
  • In general, the more I try to implement the "new messages overwrite old ones" policy, the more I realize that its cost in terms of code and synchronization protocol complexity is enormous. So I seriously contemplate giving up on it and going back to the classic "queue rejects messages when full" one, especially as I also realized upon reading @TomP's message that "how did we get there" is an extremely important information, which is perhaps more important than "where we are now" if I have to choose between the two.

If you want “queue rejects messages when full” semantics, a bounded crossbeam queue should do the trick?

It's what I intend to use for initial prototyping indeed. However, as far as I know it won't support variable-sized messages, so I'll likely need to switch to a custom implementation (which I could eventually contribute to crossbeam) when I reach that point.

FWIW, this is also why I've started implementation from the angle of log serialization. There are already many lock-free queue implementations around in Rust, so I wouldn't be surprised if someone else had already coded the one I want and could jump in this thread soon. Whereas log serialization does not seem to have been explored so far so I'm less wary of duplicating existing work.

(I have a dream that one day, once const generics have matured and maybe specialization too, general-purpose synchronization libraries like crossbeam can expose generic concurrent data structures that can be configured at compile-time for a given amount of producers and consumers, maybe liveness guarantees and capacity too, and then automatically select the best implementation for the target use case)