I was just wondering which is more suitable to use for allocating large(-ish) blocks.
Typically two 1MB blocks will be allocated when a thread starts and be freed when it terminates. ( Code here )
It allocates more as necessary ( but all are returned on thread termination ).
Maybe it doesn’t matter or make any difference?
For AMD / Intel processors 2 MiB may be a better size. That might trigger a system allocation of a “large page”.
1 Like
Global is configurable and the default is unspecified, so the answer is "it depends".
The reason I was thinking about this is I was thinking about what possible advantages, besides being about twice as fast, a local allocator might have.
I figured it should make memory usage more predictable, in the sense that memory used by threads to perform processing is always returned to the operating system, so any residual growth would be due to the “permanent” (non-local) allocations, typically some kind of global cache of objects, with a limited controlled size. But this is only true if the memory actually IS returned to the operating system, and then I remembered SYSTEM, and thought maybe I should be using that.
[ I think there might be another advantage due to memory not being fragmented by “local” allocations, but it seems a bit hard to make this precise ]
In theory decent allocators should be smart enough to cache allocations per thread, so if the concern is locality for frequent allocation churn that shouldn't be too bad; but it can't handle fragmentation if multiple threads are allocating at the same time, and it still will have some contention because it doesn't actually know these allocations don't escape the thread.
If you know for sure these allocations won't survive the thread, you should get much better performance with a truly local allocator, though I've only ever used the simplest bump allocators for that purpose so I don't have any specific recommendations here.
Right, and my intuition says there should be some advantage in knowing which allocations can escape the thread and allocate them from different blocks. Another advantage of the “local” allocator being written in Rust is the compiler can make optimisations, because it can “see” what is going on with (say) a simple Box allocation, it knows the size and alignment, so the size class computation can be done at compile time, for example. Whereas for an allocator written in “C” this is not possible. All that said, modern general purpose machines are so fast and have so much memory, it tends not to matter in practical terms. But perhaps for embedded applications it could.
Incidentally, it seems the Temp allocator, which doesn’t have any free chains, performs just about the same as the Local allocator which does. Both are about twice as fast as allocating from Global, but I think in principle there is further advantage in terms of avoiding long-term fragmentation.
1 Like
There's definitely applications that care about allocator performance a lot: games being a very stand out example. There it's common to have per-frame allocators that free everything at the end of the frame at once, so they can afford to have extremely trivial allocators that are nearly just "bump a pointer the allocation size later". It sounds like that might not be sufficient for your uses, though, especially as it must not have any Drop implementations.
I suspect for most workloads it's hard to specifically distinguish the allocator as a performance issue, even if it may actually be one in a diffuse way. An especially nasty hidden cost is memory locality, where poor locality is completely hidden in slow pointer dereferences.
Something else for me to think about: if a program is using local allocation, it probably should be using a different Global allocator, one that doesn’t attempt the usual local allocation tricks, because they are not needed and I think are actually harmful (in this situation).
Maybe it should be a very simple Global allocator that just routes to System (although I don’t even know what System does… I am not well-informed about the whole subject ). But at any rate not jemalloc or mimalloc or whatever.
Or alternatively where allocations are known to be not local, they could be made from a specific non-local allocator.
Well I have moved on a bit in my thinking. I have now implemented “Perm”, which implements both Allocator and GlobalAlloc, so can be used to define the Global Allocator.
In which case it is essential to use System for Perm at least, or there is infinite recursion. But as I have coded it, both Local and Perm use ChainAllocator internally, so that needs to use System.
I think it is logical as well in some sense. So while there is perhaps no definitive answer, at least for my purposes the answer seems to be System. It also means the memory is definitely returned to the operating system.
Just in case it is of interest ( and for review ) here is my implementation of Perm:
static mut PA: LazyLock<ChainAllocator> = LazyLock::new(|| { ChainAllocator::new() } );
/// Perm [`Allocator`].
#[derive(Default, Clone)]
pub struct Perm;
impl Perm {
/// Create a Perm allocator
pub const fn new() -> Self {
Self { }
}
}
unsafe impl Allocator for Perm {
fn allocate(&self, lay: Layout) -> Result<NonNull<[u8]>, AllocError> {
let a = &raw mut PA;
let a = unsafe{ LazyLock::<ChainAllocator>::force_mut(&mut (*a)) };
a.allocate(lay)
}
unsafe fn deallocate(&self, p: NonNull<u8>, lay: Layout) {
let a = &raw mut PA;
let a = unsafe{ LazyLock::<ChainAllocator>::force_mut(&mut (*a)) };
a.deallocate(p, lay);
}
}
unsafe impl GlobalAlloc for Perm {
unsafe fn alloc(&self, lay: Layout) -> *mut u8 {
let nn =self.allocate(lay).unwrap();
let p : * mut [u8] = nn.as_ptr();
let p : * mut u8 = p.cast::<u8>();
p
}
unsafe fn dealloc(&self, p: *mut u8, lay: Layout) {
unsafe {
let nn = NonNull::new_unchecked(p);
self.deallocate(nn, lay);
}
}
}
I was a bit uneasy about the above, and now realise it is wrong… LazyLock only protects initialisation, I need a Mutex. I have changed it to this, I hope this is better!
static mut PA: Mutex<Option<ChainAllocator>> = Mutex::new(None);
unsafe impl Allocator for Perm {
#[allow(static_mut_refs)]
fn allocate(&self, lay: Layout) -> Result<NonNull<[u8]>, AllocError> {
unsafe {
let mut a = PA.lock().unwrap();
if a.is_none()
{
*a = Some(ChainAllocator::new());
}
let a = a.as_mut().unwrap();
a.allocate(lay)
}
}
#[allow(static_mut_refs)]
unsafe fn deallocate(&self, p: NonNull<u8>, lay: Layout) {
unsafe {
let mut a = PA.lock().unwrap();
let a = a.as_mut().unwrap();
a.deallocate(p, lay);
}
}
}
I'm not familiar with static mut other than the knowledge that is almost impossible to use it soundly. If you're using a Mutex, do you still even need the static to be mut?
2 Likes
I think it is necessary to implement a global allocator. Here
it gives an example of using Mutex:
// Change from this:
// static mut QUEUE: VecDeque<String> = VecDeque::new();
// to this:
static QUEUE: Mutex<VecDeque<String>> = Mutex::new(VecDeque::new());
fn main() {
QUEUE.lock().unwrap().push_back(String::from("abc"));
let first = QUEUE.lock().unwrap().pop_front();
}
although I don’t think the above will compile without an #[allow(static_mut_refs)]
[ Edit: I am wrong, it will compile ]
No I don’t. I also don’t need the unsafe blocks. Although I think #[allow(static_mut_refs) is very unsafe! But once I take the mut out, I don’t need the directive at all. I guess it recognises that Mutex (and no mut) makes it safe, and the example will compile after all. Phew.
My revised code:
static PA: Mutex<Option<ChainAllocator>> = Mutex::new(None);
/// Perm [`Allocator`].
#[derive(Default, Clone)]
pub struct Perm;
impl Perm {
/// Create a Perm allocator
pub const fn new() -> Self {
Self { }
}
/// Get the number of outstanding allocations.
pub fn alloc_count() -> u64
{
let a = PA.lock().unwrap();
if a.is_none() { return 0; }
let a = a.as_ref().unwrap();
a.alloc_count
}
}
unsafe impl Allocator for Perm {
fn allocate(&self, lay: Layout) -> Result<NonNull<[u8]>, AllocError> {
let mut a = PA.lock().unwrap();
if a.is_none()
{
*a = Some(ChainAllocator::new());
}
let a = a.as_mut().unwrap();
a.allocate(lay)
}
unsafe fn deallocate(&self, p: NonNull<u8>, lay: Layout) {
let mut a = PA.lock().unwrap();
let a = a.as_mut().unwrap();
a.deallocate(p, lay);
}
}