On SpinLock performance

I was experimenting with atomic types and, by pure chance, I stumbled upon a strange way to improve the performance of my naive spin lock, (and by quite a margin >10x on a couple of machines I tested): simply adding a sleep. I’m sure this is a known phenomenon, but I couldn’t find much about it, which is why I’m posting here.

From what I’ve read, spinlocks are sometimes frowned upon, but I wrote this as an MVP to show what I’m talking about, so, as a disclaimer, this isn’t representative of my real use case; . Anyway, I’ve included my implementation and the tests I’ve been using to benchmark it.

use std::{
    cell::UnsafeCell,
    marker::PhantomData,
    ops::{Deref, DerefMut},
    sync::atomic::{AtomicBool, Ordering},
    time::Duration,
};

struct SpinLock<T> {
    inner: UnsafeCell<T>,
    locked: AtomicBool,
    with_sleep: bool,
}

impl<T> SpinLock<T> {
    pub fn new(inner: T, with_sleep: bool) -> Self {
        Self {
            inner: UnsafeCell::new(inner),
            locked: AtomicBool::new(false),
            with_sleep,
        }
    }

    pub fn lock(&self) -> LockGuard<'_, T> {
        while self.locked.swap(true, Ordering::Acquire) {
            if self.with_sleep {
                std::thread::sleep(Duration::from_nanos(1));
            }
            std::hint::spin_loop();
        }

        LockGuard {
            lock: self,
            _p: PhantomData,
        }
    }
}

unsafe impl<T> Sync for SpinLock<T> where T: Send {}

struct LockGuard<'a, T> {
    lock: &'a SpinLock<T>,
    _p: PhantomData<&'a mut T>,
}

impl<T> Deref for LockGuard<'_, T> {
    type Target = T;

    fn deref(&self) -> &Self::Target {
        unsafe { &*self.lock.inner.get() }
    }
}

impl<T> DerefMut for LockGuard<'_, T> {
    fn deref_mut(&mut self) -> &mut Self::Target {
        unsafe { &mut *self.lock.inner.get() }
    }
}

impl<T> Drop for LockGuard<'_, T> {
    fn drop(&mut self) {
        self.lock.locked.store(false, Ordering::Release);
    }
}

#[cfg(test)]
mod nascar {
    use crate::lock::SpinLock;
    use std::sync::Mutex;

    const THREADS: usize = 4;
    const ITERATIONS: usize = 300000;

    #[test]
    fn spinlock_slow() {
        let q = SpinLock::new(Box::new(0), false);

        std::thread::scope(|s| {
            for _ in 0..THREADS {
                s.spawn(|| {
                    for _ in 0..ITERATIONS {
                        **q.lock() += 1;
                    }
                });
            }
        });

        assert_eq!(**q.lock(), ITERATIONS * THREADS);
    }

    #[test]
    fn spinlock_fast() {
        let q = SpinLock::new(Box::new(0), true);

        std::thread::scope(|s| {
            for _ in 0..THREADS {
                s.spawn(|| {
                    for _ in 0..ITERATIONS {
                        **q.lock() += 1;
                    }
                });
            }
        });

        assert_eq!(**q.lock(), ITERATIONS * THREADS);
    }

    #[test]
    fn mutex() {
        let q = Mutex::new(Box::new(0));

        std::thread::scope(|s| {
            for _ in 0..THREADS {
                s.spawn(|| {
                    for _ in 0..ITERATIONS {
                        **q.lock().unwrap() += 1;
                    }
                });
            }
        });

        assert_eq!(**q.lock().unwrap(), THREADS * ITERATIONS);
    }
}

Is the implementation correct or is this a fluke? Why does that happens, and does it happens on all architectures (mine is x86_64)? If inserting a sleep makes it faster, is there an heuristic on how to pick a value for the sleep? is there any other mechanism to achieve a better result other than sleep? (e.g. I tried a short loop followed by yield_now() but the result wasn't impressive, and also _mm_pause() with slightly better results)

Any feedback is appreciated :slight_smile:

it's unclear what you mean by "performance" of a spin lock. what's your measuement of your "performance" goal?

the point of a spin lock is to trade some CPU cycles for lower latency, because it doesn't need system calls, saving the expensive context switching.

this also means, a spin lock is best fit for (anticipated) low contention scenarios and/or very short critical sections.

adding sleeping to a spin loop defeat the whole purpose of a spin lock, making it just a (likely worse) Mutex. typical Mutex` implementations would already include a little bit of spin (as an optmization for low contention cases), only after which failed would it then invoke the system call to park the calling thread.

your test case is not the best use for spin locks (or even Mutex), it is better served with atomic operations and/or lock-free concurrent algorithms.

2 Likes