I’ve gotten improvements of 40% by making memory bandwidth limited code NUMA aware. The easiest solution, if it works for your problem, is to use multiple processes (one per NUMA node) and either launch them under numactl, or run under a VM that’s tied to a node (like small enough AWS instances). Handling NUMA binding by hand is quite a bit of work.
Fortunately new allocations tend to be on the same NUMA node as the core running the thread. So by keeping most allocations thread local, then you’ll likely get decent NUMA hits. I’ve had good luck running thread pools and preallocating large working buffers that are stored in thread_locals. I suspect using libnuma-sys to bind the thread to a particular node when creating the thread pool would help more.
Allocations are NUMA local because Rust’s allocator Jemalloc be default uses per thread arenas. And Linux memory allocations (when jemalloc mmaps in more memory) tend to be attached to the active core’s NUMA node. Be aware this isn’t try if the linux kernel has it’s NUMA policy set to interleave. Setting the NUMA policy to interleave can help if the working set of memory can’t be NUMA local, and many database recommend NUMA local.