Implementing a memory allocator in Rust

For some random reason I started thinking about this. For good performance it seems you want to avoid excessive thread synchronisation, in other words allocate locally and perhaps free asynchronously. I was wondering if Rust async could be useful here. Have you any thoughts on this matter?

I don't personally see how unless you have some clever non-standard usage in mind, which is always possible. The biggest hurdle is probably setting up an executor that the allocator can use without necessarily having access to a full allocator itself. You'll also need to either fire off a separate thread to run the executor or somehow advance its execution during your alloc and free calls.

I suspect it would be simpler to define the incremental workload in terms of bespoke data structures, which will also make it easier to analyze and prove that enough work is being done to keep the amount of unreclaimed free memory in check.

2 Likes

Allocating for concurrent programs is what jemalloc is all about. You might like to check it out. https://jemalloc.net/

I am aware of jemalloc and mimalloc ( perhaps I should have stated that).

A common approach to this is hazard pointers - very neat stuff to play around with.

I dunno how you would use async, but maybe implementing the code that sub-allocates or frees to the system allocator could be phrased in terms of an async function, but it seems like it would be pretty awkward.

1 Like

I guess without actually starting to write some code it isn't obvious to me what problems would come up. But you don't need to be able to free memory to "get going" (other than to add it to some kind of message queue or list) so I don't see "boot-strap" as being problematic.

What I very vaguely have in mind is an allocator written in Rust that is somewhat "tuneable" according to the demands of an application, perhaps with special features for special situations. It is very vague! Maybe I will start playing around and see what can be done.

First thing I am wondering about, if I have some kind of large block of memory ( maybe obtained from the OS allocator ), and I allocate a small chunk from it, later on, when it is time to free the small chunk, how do I find the block it was allocated from - can I use some kind of pointer arithmetic to find a block header, or what....

( Looking at mimalloc I found this:

/* -----------------------------------------------------------
  The following functions are to reliably find the segment or
  block that encompasses any pointer p (or NULL if it is not
  in any of our segments).
  We maintain a bitmap of all memory with 1 bit per MI_SEGMENT_SIZE (64MiB)
  set to 1 if it contains the segment meta data.
----------------------------------------------------------- */

Well, my mental model of how things might work is like this:

(1) Each thread has its own page or pages for allocations of different sizes.
(2) Each thread collects up deallocations for batch processing.
(3) Deallocations made by other threads are sent to the correct thread (somehow added in batches to some list, probably involves synchronisation).
(4) Each thread check its input list of deallocations periodically.

Async will not be useful. Async is not a special sauce. It's a syntax sugar and an abstraction for writing state machines.

Everything that async does can be done just as well or better by implementing the underlying logic by hand.

Allocators are so low-level, and performance critical, that to make something competitive you will need to avoid overheads and indirections added by async abstractions.

And I don't see where you would benefit from complex state machines and async allocation. The OS APIs for memory management are all blocking. The high level interfaces for the allocator are blocking.

Async adds extra work for management of active futures, cost of recursive polling and branching on pending states, allocating wakers and sending signals from them. That itself is a lot more work than allocators do to allocate a new block of memory.

5 Likes

The term "async" in programming context usually have 2 slightly different meanings.

  1. The operation may not complete immediately, and you want to do something more after it completed. The async syntax belongs to this category.

  2. The operation may not complete immediately, but you don't care about its completeness. All you need to know is that it will eventually complete and that's enough. That async free mechanism might belongs to here.

Since they don't share same purpose async syntax might not be helpful for async freeing.

1 Like

You're mixing a couple of concepts here: async blocks and functions are effectively state machines, (each state being an await point and the locals live at that point), while the tracking of futures, allocation, etc, are all part of an async runtime.

Neither strictly needs much if at all overhead: async can safely hold self-references which can be more efficient than equivalent manual safe code, and you can implement a runtime without any allocation or complex logic if you only have block_on, no spawn (this is pollster).

1 Like

When a high performance memory allocator frees memory it is not in general freed at once, the freed memory needs to be queued up, somehow routed to the right place, so there is some quite complex processing that needs to happen, before the memory is eventually fully freed and available for re-allocation. But I only just started thinking about it, so I only have a rather vague idea of what needs to happen.

See for example: https://www.microsoft.com/en-us/research/uploads/prod/2019/06/mimalloc-tr-v1.pdf

First, it uses three page-local sharded free lists to increase locality, avoid contention, and support a highly-tuned allocate and free fast path. These free lists also support temporal cadence, which allows the allocator to predictably leave the fast path for regular maintenance tasks such as supporting deferred freeing, handling frees from non-local threads, etc.

[ Actually mimalloc is probably a bit simpler than the potential scheme I have in mind, which involves async tasks to do things like sorting the memory to be freed into pages and maybe sorting it into address order, or something.... ! I think using linked lists might be sub-optimal, maybe ]

Users of the allocator don't need to wait until the memory is returned to the OS. You can't even have an async method for this, because it would be very prone to deadlocks - the Future waiting for one deallocation to be returned to the OS may also own some other allocation that prevents it from happening.

Deferring of work is not the same thing as an async api (async/await is not the only way of running things asynchronously).
These syscalls are blocking, and maintenance work takes CPU time, so moving them out of the critical path of code using the allocator requires use of threads. If you try to do this with async instead, without adding threads, then you will still have this thread-blocking behavior on the same thread, on top of overhead of async used to trigger it. If you use a multi-threaded async executor for this, you at best end up with the same solution, but slower by all the overhead of handling of futures, polling, and an executor.

Overhead of async is small on the scale of network I/O, or processing of queries in a complex database engine running in another process, where the unit of work is large, and the delay until results are ready is long enough to do many other things in the meantime.
However, on the scale of work done by memory allocators, cost of async is absolutely enormous. The hot path for a memory allocator can be only a handful of carefully tuned CPU instructions that find a right size bucket and bump a pointer. This is so little work that you can't beat it with async. Just the cost of calling an async function (an empty one that does nothing at all) can be much larger than that. An async executor will likely even require performing allocations to enqueue a Future and to create a Waker. You can't make an allocator faster by adding two more allocations to every one allocation, and a half a dozen extra function calls to deliver results of a single function call.

One thing big and slow enough in memory management is memory paged to disk. That is a real I/O, and theoretically a smart executor could prioritize such futures to minimize trashing. However, that requires very specialized APIs and OS integration, which aren't typical allocator APIs.

2 Likes

What I have in mind is processing say 100 or 1,000 or 10,000 deallocations at a time using a set of threads with async work-stealing, or something vaguely like that.

I just found this which looks like a good starting point:

Edit: Hmm, his code doesn't seem to compile for windows, but I was able to get past this after a while, simple test :

[package]
name = "alloc"
version = "0.1.0"
edition = "2021"

[dependencies]
lazy_static = "1.4.0"
libc = "0.2.155"
page_size = "0.6.0"

windows-sys = { version = "0.52.0", features = [ "Win32", "Win32_System", "Win32_System_Memory", "Win32_Foundation" ] }


use std::alloc::GlobalAlloc;
use std::alloc::Layout;
use std::cmp::max;
use std::ptr;
#[cfg(windows)]
use windows_sys::Win32::System::Memory as libc;

struct SimpleAllocator {}

unsafe impl GlobalAlloc for SimpleAllocator {
    unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
        let aligned_layout = match layout.align_to(max(layout.align(), page_size::get())) {
            Ok(l) => l.pad_to_align(),
            Err(_) => return ptr::null_mut(),
        };
        #[cfg(windows)]
        {
            let addr = libc::VirtualAlloc(
                ptr::null_mut(),
                aligned_layout.size(),
                libc::MEM_COMMIT | libc::MEM_RESERVE,
                libc::PAGE_READWRITE,
            );
            addr as _
        }
        #[cfg(unix)]
        {
            let addr = libc::mmap(
                ptr::null_mut(),
                aligned_layout.size(),
                libc::PROT_READ | libc::PROT_WRITE,
                libc::MAP_PRIVATE | libc::MAP_ANONYMOUS,
                -1,
                0,
            );
            addr as _
        }
    }

    unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) {
        if let Ok(aligned) = layout.align_to(max(layout.align(), page_size::get())) {
            #[cfg(windows)]
            libc::VirtualFree(ptr as _, aligned.pad_to_align().size(), libc::MEM_RELEASE);
            #[cfg(not(windows))]
            libc::munmap(ptr as _, aligned.pad_to_align().size());
        }
    }
}

#[global_allocator]
static ALLOCATOR: SimpleAllocator = SimpleAllocator {};

#[cfg(test)]
mod tests {
    // use super::*;

    #[test]
    fn it_works() {
        println!("page size is {}", page_size::get());
        let x = Box::new(99);
        println!("x={}", x);
    }
}
1 Like

Note that this is the rayon model of computation; you have a set of threads which asynchronously steal work items from each other. This doesn't involve async Rust; it's purely sync Rust, and is ideal for this workload; while you have a batch of work items to handle, actually handling them involves no CPU-idle wait time, since you're either in user-space handling your accounting structures, or in kernel-space handling hardware and/or kernel accounting structures for the memory, never letting the CPU go idle while you wait.

2 Likes

I am trying to think how this might work. I have in mind there are some "management tasks" which get sent messages of two types:

"Send me a N free memory locations of size M"

"Here is a list of free memory locations".

The first type of message is processed by a task that decides which managing task should be asked to supply the memory (one that isn't currently busy, and perhaps meets some other criteria ).

The second type of message is processed by some kind of sorting task, which sorts the free memory locations out and forwards them onto the tasks managing the affected memory regions.

Quite when "new fresh unfragmented" memory should be used to satisfy requests and when "old memory fragments" should be used, I am not sure how that could be decided, I suppose some kind of heuristic based on what is available in what quantity.

The total number of "management tasks" could be quite large, for every actual thread, I can imagine as many as 50 tasks involved in the memory supply (and disposal). Maybe. Or maybe you can do it with a lot less tasks and use threads rather than async tasks.

[ The requesting thread would "order" new memory based on past history, in other words if it uses 500 blocks per milli-second of size 1kb, it will pre-order memory so that it is available before it is needed, based on recent request patterns. If it turns out excess memory was ordered it can be freed without being actually used. ]

This sounds like a recipe for a bad allocator; allocation has an inherent program-order dependency on the return from the allocator, and thus has to wait for the allocator to finish allocating memory, so you need allocation to be fast.

In contrast, when you free memory, you're telling the allocator that you no longer want this block of memory any more, and you can continue executing while the allocator does whatever it does to free up that memory - given sufficient address space and memory, it's valid to have an allocator where free is a no-op.

This means that you want those management tasks to be out of the way in the allocation path - you want allocation to be cheap (ideally just giving you a chunk from a free list), with all the complication coming in when memory is freed.

Well, did you read my last paragraph about "pre-ordering" memory so it is available when needed? Mind you, whether all the extra complication involved would pay off in the end, only testing would tell - I certainly have my doubts. To be honest, just now, I am not seriously thinking about implementing anything, but on the other hand, I could change my mind if it seemed like a fun project.

I did, but that sounds like it's reacting to common allocator requests, rather than being the significant mode of allocation; you'd want to treat any case where an allocation request is not met by "pre-ordered" memory as a bug, not as "well, you don't have enough history to pre-order memory yet".

Underlying this is that allocation is always on the critical path for program runtime, whenever you allocate; freeing is not. As a result, anything you do has to allocate swiftly in the absence of any history to enable "pre-ordering", but it can use that history to work out what it should do while freeing.

1 Like