Which mutex to use? parking_lot or std::sync?

Continuing the discussion from Help me choose a Mutex/RwLock ( parking_lot ? ):

Continuing the discussion from Things in std with strong alternatives:

See the link in @quinedot's post above for tracking issue #93740.

With all that information I'm still unsure what to do in my own programs. Use std::sync::Mutex or parking_lot::Mutex?

In the async case where I need to hold a guard across an await point, I will resort to an asynchronous mutex of course, such as tokio::sync::Mutex.

But what if I lock the mutex only for a short time or in non-async programs? I'm particularly interested in cases where I use the mutex like a Send+Sync variant of RefCell, i.e. I know the mutex will never lock, but I need it to satisfy the compiler (without resorting to use unsafe). Something like this:

#[tokio::main]
async fn main() {
    let x = std::sync::Arc::new(std::sync::Mutex::new(0));
    let x2 = x.clone();
    tokio::spawn(async move {
        *x2.lock().unwrap() += 1;
        tokio::task::yield_now().await;
        *x2.lock().unwrap() += 1;
    }).await.unwrap();
    println!("{}", x.lock().unwrap());
}

(Playground)

Which would work like a RefCell, but if I used RefCell, then I can't use tokio::spawn but would have to use a local thread set:

#[tokio::main]
async fn main() {
    let x = std::sync::Arc::new(std::cell::RefCell::new(0));
    let x2 = x.clone();
    tokio::task::LocalSet::new().run_until(async move {
        tokio::task::spawn_local(async move {
            *x2.borrow_mut() += 1;
            tokio::task::yield_now().await;
            *x2.borrow_mut() += 1;
        }).await.unwrap();
    }).await;
    println!("{}", x.borrow());
}

(Playground)

I wonder how big is the overhead of std::sync::Mutex and the lock-poisoning check? Is it neglectible even when I write a lot of values with this? (e.g. a lot of elements into a Vec behind a mutex, where I acquire the lock each time I push a value)

1 Like

Side note: Re-thinking about this, the mutex might also serve as a memory barrier, i.e. for synchronizing memory when my future runs on different CPUs. (So even if I wanted to use unsafe, I couldn't replace it with UnsafeCell easily but would have to care about synchronization.)

The Windows Mutex ends up with a call to AcquireSRWLockExclusive. This has some details but not enough to assess performance. Searching the internet at large was not especially revealing. The description in Windows Internals is promising. It sounds like acquire / release is an atomic in user-mode which should make them fast.

The Tokio documentation tends to favour std::sync::Mutex especially for certain use-cases like short lock times.

Mutex on Windows seems like a great choice for the use you have in mind.

1 Like

If you haven't read it, you might get something out of the blog post introducing the parking lot abstraction to WebKit

Iirc at least the WebKit version was intentionally designed to be efficient for the "short critical section, low contention" case but I'm not sure how closely the parking_lot crate sticks to that

2 Likes

parking_lot naturally has a bunch more features than std's mutex, even with recent improvements, so use it if you need them. Talking about what performance benefits there may or may not be is entirely pointless without benchmarks. Here's a simple one:

use std::ops::DerefMut;
use std::sync;
use std::time::Instant;
fn main() {
    let parallelism = std::thread::available_parallelism().unwrap().into();
    let now = Instant::now();
    mutex_bench::<sync::Mutex<_>>(1, 10_000_000);
    println!("std uncontended: {:?}", now.elapsed());
    let now = Instant::now();
    mutex_bench::<parking_lot::Mutex<_>>(1, 10_000_000);
    println!("parking_lot uncontended: {:?}", now.elapsed());
    let now = Instant::now();
    mutex_bench::<sync::Mutex<_>>(parallelism, 10_000_000 / parallelism);
    println!("std medium contention: {:?}", now.elapsed());
    let now = Instant::now();
    mutex_bench::<parking_lot::Mutex<_>>(parallelism, 10_000_000 / parallelism);
    println!("parking_lot medium contention: {:?}", now.elapsed());
    let now = Instant::now();
    mutex_bench::<sync::Mutex<_>>(500, 20_000);
    println!("std high contention: {:?}", now.elapsed());
    let now = Instant::now();
    mutex_bench::<parking_lot::Mutex<_>>(500, 20_000);
    println!("parking_lot high contention: {:?}", now.elapsed());
}

fn mutex_bench<M: Mutex<usize>>(threads: usize, count: usize) {
    let mutex = M::new(0usize);
    std::thread::scope(|s| {
        for _ in 0..threads {
            s.spawn(|| {
                for _ in 0..count {
                    let mut guard = mutex.lock();
                    *guard += 1;
                }
            });
        }
    });
    assert_eq!(*mutex.lock(), threads * count);
}

trait Mutex<T>: Sync + Sized {
    type Guard<'a>: DerefMut<Target = T> + 'a
    where
        Self: 'a;
    fn new(t: T) -> Self;
    fn lock(&self) -> Self::Guard<'_>;
}
impl<T: Send> Mutex<T> for sync::Mutex<T> {
    type Guard<'a> = sync::MutexGuard<'a, T> where Self: 'a;
    fn new(t: T) -> Self {
        sync::Mutex::new(t)
    }
    fn lock(&self) -> Self::Guard<'_> {
        self.lock().unwrap()
    }
}
impl<T: Send> Mutex<T> for parking_lot::Mutex<T> {
    type Guard<'a> = parking_lot::MutexGuard<'a, T> where Self: 'a;
    fn new(t: T) -> Self {
        parking_lot::Mutex::new(t)
    }
    fn lock(&self) -> Self::Guard<'_> {
        self.lock()
    }
}

Running it on my Windows 4 core, 8 thread x64 machine I got this:

std uncontended: 146.9767ms
parking_lot uncontended: 120.4104ms
std medium contention: 194.4102ms
parking_lot medium contention: 526.2664ms
std high contention: 205.6215ms
parking_lot high contention: 353.115ms

On my machine, parking_lot's better if the mutex is uncontended, but std wins otherwise.

edit:
Running the same benchmark on my raspberry pi 4b running 32-bit Linux (4 cores, 4 threads, target armv7-unknown-linux-musleabihf) I got this:

std uncontended: 537.478562ms
parking_lot uncontended: 520.8173ms
std medium contention: 1.24173564s
parking_lot medium contention: 788.189293ms
std high contention: 1.440051111s
parking_lot high contention: 39.99078841s

Under normal conditions, parking_lot significantly outperforms std's mutex, but hits a pathologic case with a very high number of threads.

8 Likes

At this point in time, my decision tree is roughly

  • By default, use std's.
    The OS provides a reasonable Mutex, and using the platform Mutex can offer debugability benefits.
    • e.g. debuggers are more likely to understand what an OS Mutex deadlock looks like than parking_lot's.
    • e.g. the platform thread scheduler probably cooperates better with the native Mutex.
    • e.g. std ships debugger visualizations for std types, but non-std libraries cannot do so yet.
  • Be aware that the choice can leak to your downstream, so switching isn't trivially semver-compatible.
    • e.g. parking_lot's (Ref)UnwindSafe impls and probably other auto traits' differ from std's sometimes.
    • If these OIBIT differences matter, use the version which works for your use case.
  • If you need mapped lock guards, probably use parking_lot.
    • The alternative is yoking std's lock guards or finding a crate which provides mapped guards for std.
  • If consistent contended lock behavior across platforms is meaningfully relevant, consider parking_lot.
    • If you need a mutex which is guaranteed to be {eventually fair, fair, reentrant}, use parking_lot.
  • When using parking_lot, strongly consider if you want to re-provide poisoning on top.

With no context, I'd guess this is due to parking_lot's eventual fairness algorithm causing issues when combined with the Pi's thread scheduler juggling the core oversubscription.

6 Likes

As I tend to write either massively-parallel code (with threads interacting only very rarely) or go all the way async, I hardly ever need to seriously consider contention in my code, and when I do, it often turns out there's a better solution than a potentially contended Mutex.

Anyway, my – accordingly less sophisticated – decision tree was also largely "use std by default, parking_lot for mapped guards", so that definitely seems reasonable.

So I would say there isn't a clear winner in those cases? (performance wise)

I likely will never have that case, as I would use thread pools (which is generally what should be done instead of starting a lot of OS threads, right?).

Interesting point!

Another thing is that a std::sync::MutexGuard is !Send, while parking_lot::MutexGuard may be Send if the send_guard feature is used (which may also be introduced by a dependency). I remember Tokio explicitly supresses leaking these properties by using PhantomData fields, see Tokio PR #4359.

I guess when you use std::sync::Mutex and if you're overly cautious, you might opt out of RefUnwindSafeness with a PhantomData<RefCell<()>, so if you later want to drop RefUnwindSafeness by moving to a lighter lock it would not be a semver-incompatible change?

Until now I didn't even know what mapped lock guards are. I'm still not sure what they are good for. Basically limiting the access to a sub-component, if I understand right. I don't think I will usually need this but keep it in mind.

Yeah, I guess the missing poisoning is a double-edged sword.

I think I will favor std::sync::Mutex more often now. Maybe a reasonable rule for me could be:

  • Use std::sync::Mutex by default, unless
    • any other features/properties are needed or
    • a different implementation gives a specific measurable (and needed) performance gain for a particular platform in a particular use-case, and the program is aimed to (solely) run on that platform.

That depends. If you are expecting that access to the mutex will be mostly uncontended, then parrking_lot is the clear winner. Otherwise, results vary based on platform and workload, do benchmarking if the performance is an issue.

A thread pool is just managed collection of OS threads, you can run into all of the same problems. For example, tokio is perfectly happy to spawn 500 threads to run blocking tasks. You will still likely never have this case, unless you suddenly decide it's a good idea to spawn a whole bunch of blocking tasks that all are fighting over the same mutex, which is a bad idea for more reasons than just that. Regardless, even if you're using spin locks, mutexes generally don't cause bottlenecks so long as you can keep critical sections short and contention low.

This is a pretty sane set of rules and is about what I use.

... however, spawn_blocking's documentation specifically calls this out and that Tokio's blocking thread pool is intended to be used for threads primarily blocked on OS calls (e.g. blocking IO) rather than threads doing CPU work.

In short, Tokio's spawn_blocking should be treated roughly like spawning an OS thread. The one improvement is that when a blocking task finishes, the thread will stay around for some period of time to be reused (instead of creating a new thread) before it does eventually get released.

If you're doing CPU work, Tokio suggests using a CPU focused executor such as rayon. (You can actually spawn tasks directly onto rayon's task pool without using the parallel iterators, though this isn't used much comparatively.) Doing so will keep you to a reasonable number of CPU-bound threads, though you do of course need to keep in mind when using blocking thread pools that if a thread in the pool blocks, it's not able to yield its time to the executor.

1 Like

One more case where parking_lot is better is if you have a tree/graph structure with enormous quantities of rarely contended locks (e.g. a JavaScript+DOM heap), where the single byte size + global lock table actually carries meaningful memory wins.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.