Any real-time safe logger around?

derekdreery · November 15, 2019, 9:13am

I'm not sure if it's helpful, but there is a bump arena crate for rust (bumpalo) that in principle gives deterministic time allocations (with the proviso that drop is never called and the only way to recover memory is to deallocate everything.

HadrienG · November 15, 2019, 9:48am

Thanks for reminding me about this nice crate! Unfortunately, I think this is not the right problem space for it to shine:

Bump systematically allocates a new chunk when the original chunk is full, whereas I want allocation to fail in that case (because RT code should never call the system allocator, as that allocator may do all sorts of RT-unsafe things).
Bump is !Sync, whereas my message-passing use case involves multiple threads (and because RT code should never block, I can't just put a Bump in a Mutex).
In the case where there are multiple RT threads (which I'll need to support if I want to have those nice no-fuss global loggers) racing to insert messages into the queue, I will need out-of-order allocation and deallocation (whereas the only form of partial deallocation which could possibly be added to Bump is in-order deallocation).

At the moment, I'm prototyping my API design using bounded channels of SufficientlyAlignedNewtype([u8; SUFFICIENTLY_LARGE_MAGIC_CONST]), but once I am ready to implement hack-free variable-size message support, I think I'll probably roll out some sort of RT-safe concurrent bitmap allocator as a backing store, unless someone suggests me a better idea.

derekdreery · November 15, 2019, 12:57pm

Yeah I understand what you say about bumpalo's underlying allocation strategy. I feel like your 3 issues are increasing in fundamentality.

In my head I was wondering if there could be some variation of bumpalo that just used a static block of memory for its arena for no-alloc situations.
The issue about making bumpalo Sync in a non-blocking way I can see is hard to solve. You would need to synchronize the incrementing of the bump pointer, so you would need to use an atomic integer, thereby not being lock-free.
Maybe you could have a ringbuffer-style bump arena, where messages are allocated at one end and deallocated at the other. You could probably discard old messages by just incrementing both counters when the buffer is full. I'm not sure I understand why you would need out-of-order alloc and dealloc, but I'm probably just missing something.

In any case I'd be very interested in reading about whatever design you go with in the end.

HadrienG · November 15, 2019, 1:35pm

I agree that this is not really a fundamental limitation of bump allocator designs, but of the specific bumpalo implementation + API. We could easily imagine a variant of bumpalo that does not reallocate, or even one that does not allocate at all as you're describing.

Modifying the existing bumpalo crate to do this, however, might be a bit painful because this goes against its current documented API contract and there is little headroom for backwards-compatible API extensions.

The best I could think about would be to add a new Bump constructor with more tweaking headroom (some sort of Bump::new_with_config(BumpConfig)) that allows disabling the default allocate-on-storage-exhaustion strategy. It would work for my RT multimedia use case, maybe it could work for the no-alloc scenario that you mentioned too if the config builder also allowed providing an owned storage block and Bump were modified to be generic over the kind of storage block in use?

Thankfully, Rust atomic types are guaranteed to be at least lock-free, so this is not an issue. I suspect you may have confused them with their C/++ cousins which do not provide this guarantee? But given this property, classic bump allocation within a chunk can easily be made lock-free using AtomicUsize::fetch_add()...

Now, ring buffer style storage reuse upon reaching the end of chunk, which I'll need since I don't want to allocate more chunks at runtime like bumpalo currently does, is a bit harder to make lock-free. It will probably at least require some CAS loop, possibly some bit golfing too in order to check start-of-allocated-region pointer and modify end-of-allocated-region pointer in a single atomic transaction. But that only means that our design is not wait-free, it's still lock-free because any allocating thread will eventually complete in absence of active obstruction from concurrently allocating threads.

As always, though, the hard part is lock-free memory liberation. I have not rigorously checked, but I suspect that liberating everything as bumpalo currently does should be easy to port to a lock-free design. However, I also need lock-free fine-grained and out-of-order liberation, which is a Much Harder problem, because...

Let me try to explain the problem that I have in mind, then:

Two RT threads R1 and R2 ~simultaneously decide to send log messages to a logger thread L
They race to allocate memory from bumpalo. Thread R1 wins the race and gets the first memory block in chunk order, then thread R2 gets the next memory block.
Then they start to fill their memory allocations with the log messages that they respectively want to send to L. For some reason, thread R2 finishes first...
...and therefore, R2 inserts its message in the log queue first, followed by R1.

Now, after printing log messages, logger thread L wants to liberate the corresponding memory allocations so that they can be used again by R1 and R2. But the problem is that if log message allocations are liberated in the order where they are received by L, that order will be reversed wrt the order in which R1 and R2 allocated memory. This out-of-order liberation pattern is not compatible with ringbuffer-style arena allocation/liberation.

I cannot think of a clean way to avoid out-of-order memory liberation without making the design blocking for RT threads by effectively having thread R2 wait for thread R1 to send its message at step 3, but maybe I haven't thought about the problem hard enough yet.

One unclean way would be to have logging thread L (which does not operate under RT constraints and can therefore allocate and block as much as it wants) store allocations in an unbounded reordering buffer, waiting for the next bump allocation in chunk order to come through before liberating any subsequent ones. But I would rather not go there if possible.

HadrienG · November 15, 2019, 2:20pm

As a teaser, the reason why I would be spontaneously going for bitmap allocation if I don't find a better approach, is that it is one of the three broad approaches to storage allocation with variable-size allocation & out-of-order liberation that I know about:

Linked lists (e.g. one list of free memory segments in the chunk + one list of busy segments, allocation searches a large enough chunk in the free list via various heuristics and liberation finds back the busy segment, puts it back in the free list, and merges with previous/next neighbor if possible)
Bitmaps (memory is divided in fixed-size chunks, and a bitvec is used to track which chunks are busy and which chunks are free, allocation searches for a contiguous hole in the bitvec and fills it with 1s while liberation finds back the corresponding bitvec location and sets it back to 0s)
Buddy algorithms (cousin of the bitmap allocator which is optimized for faster search and more efficient management structures when allocating objects of very different sizes, at the cost of extra algorithmic complexity and slower liberation, basically works by maintaining multiple sparse bitmaps at various storage granularities)

I think linked lists are not a good idea for this use case because they have unbounded memory usage for nodes (in addition to poor cache locality properties and higher memory overhead than bitmaps with a well-chosen chunk size for small and relatively homogeneous allocations).

And the extra complexity of buddy algorithms (especially in a lock-free and allocation-free context...) does not seem justified by the logging use case, which should have somewhat homogeneous allocation size, and not be bottlenecked by bitmap search performance on any reasonably-sized log buffer.

derekdreery · November 15, 2019, 3:09pm

Thanks for your responses - they are very illuminating!

derekdreery · November 15, 2019, 3:12pm

Aah I see, the way I imagined it was that the logger thread occasionally checked some bit to see if the next log message is ready to be read. That way if R2 finishes before R1, basically it won't get logged until R1 has finished (and the logger thread subsequently checks R2). So although they may not be finished in order, they will be processes in order.

HadrienG · November 15, 2019, 4:40pm

That sounds like an interesting track of thought, I didn't thought about making L block on the first real-time thread that allocated memory. But I'm not sure if I fully follow how you are proposing to do it. Assuming that "the logger thread checked some bit to see if the next log message is ready to be read", how should the RT threads know what is the next log message and whether it has been read or not, and therefore know when that bit should be set? Knowing that the logger thread may also concurrently reset it at any moment?

Maybe let's start with a recap where we were on my last post. If we have a memory allocator and a log queue as two separate lock-free data structures, as in the design that I currently imagine, then allocating memory for a record and inserting a record in the queue (or even reserving space in the queue for inserting a record later on) are two separate atomic transactions which may be interleaved, leading to the above out-of-order deallocation problem.

Now, how could we fix that without switching to an allocator that support out-of-allocator deallocation? Intuitively, I could think of two ways:

Merge allocator + log queue functionality into a single lock-free data structure where allocating memory and reserving a slot in the queue is a single atomic transaction?
- Implies a two-phase queue push transaction, one phase to reserve a queue slot and one phase to make it visible after filling in data, but that part is not too hard to implement.
- Filling a queue slot wakes up logging thread, but said thread only moves forward if the first reserved slot in the queue is now full. So far, so good.
- However, combining so much responsibility inside of a single synchronized object is likely to make it uncomfortably complex.
- And commiting a transaction as complex as "allocate memory and reserve a queue slot" with a single hardware atomic operation sounds difficult as well, the code might get ugly and/or inefficient.
Or maybe the idea is to stop using a FIFO log queue entirely, in favor of a priority queue or sorted list design where the logging thread has an easy way to query and pop the enqueued log with smallest mem address?
- Then I guess it's just a matter of checking if that mem address matches the beginning of the bump allocator's allocated region, and falling back to sleep otherwise, since the logger thread asserts full control over that "beginning of allocated region" pointer?
- OTOH, it may be hard to implement a lock-free sorted queue without heavy CAS-ing...
- Is this what you had in mind? Or something else?

HadrienG · November 15, 2019, 7:57pm

Ah, but wait, there's more. Of course it couldn't be so easy

So far, we assumed that if all other problems are resolved, using a bump allocator design with a single chunk and no additional allocations is as simple as reusing the original chunk in a ring buffer-like way:

[         XXXX ]
[         XXXXX] <- alloc
[X        XXXXX] <- alloc
[X         XXXX] <- free
[XX        XXXX] <- alloc
[XX         XXX] <- free

And it will indeed work as expected... as long as we are allocating fixed-sized chunks that evenly divide the allocator's backing chunk, as people typically do with ring buffers. However, here, we're going for variable-size chunks, and that brings interesting conceptual issues on its own.

Consider this, for example:

[         XXX  ]
[XXXXXX   XXX  ] <- large alloc, can only go here!
[XXXXXXXX XXX  ] <- small alloc, choice 1?
[XXXXXX   XXXXX] <- small alloc, choice 2?

Uh. That does not look quite like a ring buffer anymore.

The "small alloc, choice 1" variant can be achieved with a cousin of the ring buffer design called the bip buffer. Note that it makes somewhat unsatisfying use of the backing storage as it can leave a hole on the right until the next buffer rotation. The wasted space will admittedly only matter when operating on a tight log buffer budget, though, so it might be one of those things that don't really need optimizing...

Another thing about the bip-buffer is that it has quite a bit more state than a ring buffer (at minimum start1, end1, start2, end2 instead of just start, end), and that might make it harder to build a lock-free variant of it, since machine atomic instructions are limited to usize width... But I haven't seriously studied the problem so I'm not sure about that.

On its side, the optimally packed "small alloc, choice 2" pattern is probably out of reach for any sort of sane constant-space "start pointers and end pointers" design, as jumping back and forth between the left and right allocation would be a synchronization nightmare... and from the perspective of displaying logs in memory order, it would actually amount to inserting a log somewhere in the past!

system · February 13, 2020, 7:57pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
A logging library prototype announcements	4	551	January 12, 2023
Rb - a thread-safe ring-buffer announcements	10	4446	January 12, 2023
Implementing a logger that can log from any thread help	4	339	February 17, 2024
Safe, Stable and Scoped Threadpool announcements	7	3188	January 12, 2023
Async Buffered Log backends help	1	234	September 21, 2022

Any real-time safe logger around?

Related Topics