I have a server application that needs to serialize messages into slices of bytes of different lengths, and periodically retry sending those messages until they either expire or are acknowledged. I've identified this area of my code as a performance bottleneck and am looking for ways to improve beyond having a global allocation for each message slice.
These messages are serialized once and then enqueued to be sent to multiple clients, though not every client gets every message. Ideally each client's representation on the server would have a VecDeque
containing a collection of Rc
s or some sort of refcounted representations to a serialized message (a [u8]
of variable length) that is waiting to be (re)sent.
My first pass sketch of a data structure for this would be:
- Create a big
[u8]
page/chunk to push messages to as subslices
- For each pushed message, provide an
Rc
or Rc
-like handle to that subslice of the chunk
- This refcount is shared for the entire chunk, not just the subslice
- When all of the references are dropped (no subslice has an active reference anymore), return that chunk to a free list
- If there isn't room in the currently active chunk, either pull one off the free list or allocate a new one
With that in mind, some questions:
- Does a crate like this already exist? I'm aware of some bump allocators, but they work differently from what I have in mind.
- Could I just do this with
Rc
? I found this thread about creating subslices. Can an Rc
also be changed to have custom drop behavior (e.g. returning the chunk to a free list rather than truly dropping it)?
The bytes
crate has many of the things you're looking for. However, there is no built-in way to maintain a free list of chunks, that I can see anyway.
Yeah, I've looked at the Bytes data structure which offers the subslicing capabilities, but not chunk freeing and reuse. There's a point here where I would have enough chunks allocated for the lifetime of the program and wouldn't need to allocate any more, so being able to get to that state would be nice for predictability. This especially if I could profile and dial in good initial allocation sizes.
Something to keep in mind is that what you're asking for is a kind of memory allocator — every memory allocator is in the business of taking blocks of unused memory and slicing off pieces for separate use — and therefore even an ideal implementation of it won’t necessarily be an improvement over using the default allocator (or a different global allocator crate).
Specialized allocators like bump allocators can be faster by doing less work, but it’s not clear that there would be less work in this case.
You might be best off exploring alternative global allocators rather than looking for a specialized allocator.
1 Like
That's true, but in this case I do think there's potential benefit in locality. All of these messages will be accessed semi-sequentially for sending, so storing them sequentially in blocks seems advantageous over arbitrary global allocation.
Maybe, maybe not. Perhaps start by using the global allocator, then profile to determine whether allocation is a significant performance issue?
I have. As I mentioned in the OP, I've identified this area of my code as a performance bottleneck and am looking for ways to improve beyond having a global allocation for each message slice. That's why I'm looking for steps forward here. If I'm going to build/use something, it makes sense to me to take advantage of locality in the process.
Are the slices of varying size? If so, any sort of allocation scheme you invent for this will have to deal with memory fragmentation. I would give jemalloc and mimalloc global allocators a try first to see if they help. These are both very simple to add to a rust program. A single new dependency and a two line change. Cheap and easy to test.
Otherwise you need to look into what aspect of the current implementation is performing badly. Is it a pointer chasing issue? Is it that the memory allocations are slow? Are you having CPU cores fight over the cache line that contain the reference counters for Arc? For all of these there are relevant performance counters to measure with perf
on Linux (other platforms also have solutions, but I don't know much about them).
Without a deeper understanding of how the current code works and why it is slow, all we can do is guess as to what the proper solution is.
I appreciate the XY problem solving, but this is straying a little from the help that I'm asking for. 
I'd like to compare a pooled storage structure against the global allocator in this case, following the sketch I laid out above. I could also compare against jemalloc or mimalloc certainly. My main questions are about whether something like this currently exists as a crate (pooled/paged storage for varying sized slices), and/or whether Rc
could be overridden in its drop handling behavior to support writing something along these lines.
Then the idea of a freelist will work against you, because reusing blocks from it won't guarantee they will be sequential.
As others said allocating varying sized slices is effectively an allocator's job, the only difference being that normally allocators are global and request memory from the OS, while you want to give them a fixed-size slice to work with. Modern allocators like jemalloc and mimalloc already effectively do pooling internally for commonly sized allocation if that's what you wanted.
Rc
supports a custom allocator on nightly, so you could use that to customize the behavior for when the Rc
would want to deallocate its backing storage. Not sure if the requirements of the Allocator
trait fit your pool idea though.