NUMA-Aware Memory Allocation

Hello!

I’m interested in allocating memory to the current NUMA node to speed up memory access.

Does Rust already allocate memory to the process or thread’s current NUMA node? Or do I need to use something like libnuma-sys to achieve this?

I’ve used the C functions exposed by the libnuma-sys crate for this purpose before, but I wasn’t sure if maybe Rust did this type of allocation by default.

Thanks!

Rust does not do anything in particular for NUMA. It makes allocations through jemalloc by default on most targets, with the option to use the system allocator (like glibc) instead.

With Rust being a performance-minded language, I’m kind of surprised it doesn’t do this out of the box.

Is it an issue of supporting it across all platforms? Or does NUMA-aware allocation not provide as much benefit as I’ve been led to believe?

Does any language do this out of the box?

I don’t have much tuning experience in this area, but it seems to me that a NUMA-aware allocation wouldn’t actually help much unless you also pin your threads to specific NUMA nodes. I expect this would be a real performance benefit if you set it up correctly, but I don’t know that the language can/should do it for you.

1 Like

Good point.

I’m still fairly new to Rust, and I wanted to make sure it wasn’t already doing some of these optimizations under the hood before I jumped in and started adding them to code myself.

Thank you for the help!

I agree with @cuviper - NUMA awareness is something one is more likely to see in a language with a (relatively) heavy runtime; eg some JVMs will have NUMA aware GC impls (that makes sense because the runtime controls GC threads and their work).

Rust is something that you can build a similar runtime with, however. For example, http://seastar.io is an application framework in C++ that takes control over a lot of the machine’s resources - as part of that, their memory allocator (and sharding infra in general) is NUMA aware.

1 Like

I’ve gotten improvements of 40% by making memory bandwidth limited code NUMA aware. The easiest solution, if it works for your problem, is to use multiple processes (one per NUMA node) and either launch them under numactl, or run under a VM that’s tied to a node (like small enough AWS instances). Handling NUMA binding by hand is quite a bit of work.

Fortunately new allocations tend to be on the same NUMA node as the core running the thread. So by keeping most allocations thread local, then you’ll likely get decent NUMA hits. I’ve had good luck running thread pools and preallocating large working buffers that are stored in thread_locals. I suspect using libnuma-sys to bind the thread to a particular node when creating the thread pool would help more.

Allocations are NUMA local because Rust’s allocator Jemalloc be default uses per thread arenas. And Linux memory allocations (when jemalloc mmaps in more memory) tend to be attached to the active core’s NUMA node. Be aware this isn’t try if the linux kernel has it’s NUMA policy set to interleave. Setting the NUMA policy to interleave can help if the working set of memory can’t be NUMA local, and many database recommend NUMA local.

3 Likes

I agree with @cuviper - NUMA awareness is something one is more likely to see in a language with a (relatively) heavy runtime; eg some JVMs will have NUMA aware GC impls (that makes sense because the runtime controls GC threads and their work).

NUMA needs always support of the application.
Automatic NUMA support of the operating-system or the runtime doesn't work well.

1 Like