Improving the performance of a mutex if only 1 thread accesses the mutex 99% of the time?

I want to implement a sort of thread local, global allocator (if that makes sense)


Where the index is the thread id and 99% of the time the same thread will access the same allocator. The only time when I need to lock is when I want to free some memory and I only do that in bulk.

I wasn't sure how fast/slow mutexes are if the same thread accesses them over and over, so I made a small benchmark. I was hoping that the overhead would be minimal.

Summing mutexes of integers vs summing options of integers is roughly 32x times slower. Honestly the overhead is probably negligible in my case as I don't allocate that often but I am still wondering if I can do better, especially because I have thought of this pattern many times before.

Is there mabye another abstraction that would fit better than a mutex?


I am also not sure if a mutex is ever the right choice. Can't you implement a mutex purely with atomics(CAS)?

What platform are you measuring?

At least on Linux, a mutex is basically just an atomic CAS if it's uncontended. Otherwise it might spin a little bit, then it will enter a futex syscall to wait for it.

But even in the ideal case, a mutex has to call out to the pthread library. The bare sum you're comparing can be optimized inline, probably even vectorized.

The thread_local crate seems potentially useful?

On Linux, a Mutex involves pointer chasing, even on the same thread.

How about using parking_lot?

That is roughly what I thought. The sum is there so that Rust doesn't optimize the function away.

I think it does the same thing I do

The benchmark already includes parking_lot, parking_lot::Mutex is roughly 2 times faster in my benchmark.

I think I'll just implement this myself using atomics and see how it compares to mutexes.

Only in the slow path. In the fast path where the thread id is already in the table, no atomic operations are required. (Except Ordering::Relaxed loads, which aren't "real" atomics - they turn into the same assembly as regular loads, only preventing the compiler from performing certain dangerous optimizations.)

Though I'm not sure how good the performance of that crate is. If you test it, you should benchmark against the standard library thread_local!.

Based on your mutex example, it seems like you can make some simplifying assumptions (number of threads known ahead of time, sequential thread IDs) which may make a manual implementation faster. But you really shouldn't use atomics or mutexes if you can help it. If you can provide an external guarantee that multiple threads aren't accessing the same allocator at the same time, e.g. because the worker threads are stopped or paused, then you can just use UnsafeCell.

In any case, consider padding each thread's entry in the Vec to the size of a cache line to avoid false sharing.

1 Like

FWIW, the thread_local crate was specifically motivated for use cases where thread_local! was insufficient. If you can use thread_local! then I suspect that's probably better, but @Amanieu might have other thoughts.

Did you get a chance to try this out? How did it go?

I haven't but I'll implement it today. Actually the thread_local crate is very close to what I want but I still don't understand everything that is going on. It does some pretty weird "hashing" that I haven't seen before.

ThreadLocal<T> basically works as a HashMap<ThreadId, T>: it's basically a map which uses the current thread ID as a key. It's implemented with a semi-lock-free hash table which allows very fast lookups (~4ns).

Out of curiosity where is 0x9E3779B97F4A7C15 coming from? Did you pick it randomly?

Do you think it would be possible to access a thread_local item per index/thread-id? You expose iteration with iter and iter_mut but I have something like this

struct ThreadLocalAllocator {
    thread_local: CachedThreadLocal<Allocator> ,
    allocations_to_free: SomeThreadSafeQueue<Allocation>

impl ThreadLocalAllocator {
    pub fn free_unused(&self){
         while let Some(alloc) = allocations_to_free.try_pop(){
               let allocator = self.thread_local.get_with_id(alloc.thread_id);
               //... do stuff

struct Allocation {
    thread_id: usize

I want to be able to free the correct allocation, but this means I need to be able to look it up from the "outside".

I am not even sure if that is a good approach. I have not much experience with multithreaded code.

This constant is commonly used in hash functions, you can search for it to find more examples of its use. It is based on the golden ratio.

If you're accessing an allocator from multiple threads then you will need to use a mutex.