What makes async mutex more expensive than sync mutex?

Tokio docs recommend using sync mutex where they are feasible for better performance. Meanwhile the docs also mention the async mutex uses a sync mutex internally.

One question is: The std sync mutex (guard) is not Send, but the tokio async mutex (guard) is. This implies one can lock a tokio async mutex in one thread and unlock in another thread, which seems impossible using the std sync mutex. Would this mean the tokio async mutex uses a specially crafted sync mutex which is not the same as the std sync mutex?

Another question is: What exactly makes an async mutex more expensive than a sync mutex? After all, performance seems the only reason the tokio docs prefer using sync mutex to spamming async mutex in async programming.

1 Like

You can look at the code to check the hypothesis. In short, Tokio uses a standard Mutex in its semaphore waitlist for the async Mutex. Its MutexGuard doesn't use the standard MutexGuard (which is only held for a very short period as in internal detail of the semaphore implementation).

Given that the standard Mutex exists inside Tokio's Mutex, the expense can only be additive. It can't somehow be faster than what it depends on.

As for why the sync Mutex is recommended at all, it's because if you only need to hold a lock for a few hundred microseconds and it is not held across an await, then it can't really be considered "blocking". And then you don't even have to pay for all of the semaphore waitlist stuff, either. See: Shared state | Tokio - An asynchronous Rust runtime

There is still one caveat, though: WASM. You cannot block the main thread, even with a Mutex. This happened to a user a while back: [WASM + Winit + Wgpu] Weird error about waiting on the main thread despite not waiting on the main thread - help - The Rust Programming Language Forum (rust-lang.org)

8 Likes

While this is true, it doesn't necessarily mean the expense is strictly or much higher. 1) The sync mutex is used in the semaphore waitlist, but if the semaphore can be locked at once then the waitlist does not have to be consulted and incurs no extra expense, so these two are largely equivalent and users pay little or no extra expense for using the async version. If one argues the async version incurs much higher expense during acquire, I'd like example programs with profiling data to support this argument. 2) OTOH in case the semaphore cannot be locked at once, yielding is much safer than blocking because blocking may introduce deadlock. This can easily lead to bugs when a programmer later inserts an await in the mutex scope, unaware of it. 3) I still don't have intuition why the std mutex guard is not Send but the tokio mutex guard is Send, albeit I now see the tokio mutex guard is a customized one. I know some other non-std sync mutex guards are Send so this is not async specific, but the low-level implementations resulting in this Send vs non-Send are not very clear to me.

1 Like

You'll have to forgive me, but I haven't seen any claims that the cost is much higher. Just that it follows that having a mutex anywhere means that mutual exclusion is happening (by definition) and therefore all of the usual suspects apply WRT contention and whatnot.

Same, but I haven't seen any such claims nor data to support them.

Can you elaborate on this? I can't tell if you are speaking of the use of a sync Mutex within the async semaphore, or a hypothetical situation involving a user deciding to use a sync Mutex in their own async code. For the former, the Mutex is well-abstracted and callers do not have direct access to it. We can also trust that the Tokio maintainers are aware of the pitfalls and can navigate them safely.

It's just because various platforms have certain thread-safety requirements and Rust's standard library (abstracting over many platforms) needs to address these concerns with a lowest common denominator, Namely that if any platform does not allow a mutex to be unlocked on a thread that did not lock it, none of them can. More details on this aspect here:

Example of a type that is not Send? - help - The Rust Programming Language Forum (rust-lang.org)

For a more realistic example, it turns out that MutexGuard is not Send , since for the definition of MutexGuard to be maximally portable, it needs to support the POSIX API of pthreads , which states that a lock can only be released from within the thread where it was created.

1 Like

For this one I mean the latter. Because tokio devs recommend using sync Mutex when possible. One problem I can see is, while a sync Mutex may be good in current code, another developer, or yourself 30 days in future, may insert an await within the scope of the mutex guard, causing the sync Mutex to become cross-await. I think they know the compiler won't always help find this bug, then you have weird deadlocks that could be hard to debug.

Overall I'm on the side of favoring async locks over sync ones in async programming. So when I see people suggest otherwise (especially from async framework devs) I would like to see the theory behind and real profiling data to support their argument.

The best info on the subject is the Tokio page I linked earlier, and this one: Async: What is blocking? – Alice Ryhl And the basis of the argument is that using a sync mutex does not necessarily prevent the async runtime from making progress on its tasks. It does require some discipline, but the same could be said of many things in programming.

There isn't much to show in the way of profiling, since it would only demonstrate things that are already well known; mutexes are relatively inexpensive when not under contention. But you can show that holding a lock for tens or hundreds of microseconds between any two await points does not block the task in any meaningful way. Just compare it to any other operation that takes roughly the same amount of time, like maybe parsing a large JSON blob and doing all of the allocations and some other transformations. None of that would be considered blocking. Neither complete instantaneously, but crucially neither prevent the executor from polling futures.

If so, there is no big difference using sync or async mutexes when not under contention because both are inexpensive anyway. And when under contention, sync mutexes block while async mutexes yield. Why not prefer yields? In either case, sync mutexes don't give much advantage and async mutexes look much safer.

The point that I can't seem to get across, despite my best efforts, is that lock contention is what matters. A multithreaded async runtime limits the number of OS threads to available CPU cores, meaning there is an upper bound to how much contention is even plausible for the sync mutex.

Let's suppose that the rough average duration for holding a lock is 50 µs. With 8 threads, the median wait while acquiring the lock under contention (every thread racing to acquire) will be about 200 µs; 50 µs * 8 threads / 2 to estimate the median. That's well within reason. With 1,000 threads, the median wait is 25 ms! Which is far, far worse because there is much more contention from many threads waiting on a single lock. In both cases, the lock serializes access to about 20,000 per second, but with low contention any number async tasks still make progress with a median lower bound of 200 µs. The end result is similar throughput for both systems, but lower latency for the async system with 8 threads. Contention matters.

This doesn't yet explain what's "wrong" with the async mutex, but it sets the stage for what the actual concern is: tasks contending on a shared resource.

Going back to the same reference I provided before: Shared state | Tokio - An asynchronous Rust runtime

If a large number of tasks are scheduled to execute and they all require access to the mutex, then there will be contention. On the other hand, if the current_thread runtime flavor is used, then the mutex will never be contended.

This is another way to state the same thing, but they've reduced N to its logical conclusion: 1 for the current_thread scheduler where no contention is observable.

The only way that async tasks can contend for the lock is by holding it across await points. And the only way to do that is with an async mutex. So, if you're using an async mutex because the lock can be held across the await points, you are introducing task-level contention that you cannot get with a sync mutex. And if there are a large number of async tasks, you are going to have much worse performance overall from this architectural decision.

For these reasons, the suggestion to "just use a sync mutex" is sound advice. Or at least a good default to start with. The suggestion to choose other options over the async mutex when you do need to access a shared resource over an await point is also sound, for reasons I will discuss shortly. Here are the alternatives recommended by the linked article:

  • Switching to a dedicated task to manage state and use message passing.
  • Shard the mutex.
  • Restructure the code to avoid the mutex.

The first of these is personally my favorite [1]. The reason I prefer it is because it's really hard to screw it up, super simple to implement, and it keeps control flow sequential without introducing callbacks. It's also purely async, so that readers can see the await points, and there is no superfluous blocking.

Sharding the mutex is a way to further reduce contention, which is good for reasons that should be obvious, having discussed contention above.

Restructuring the code is not always possible. But if you can get away with it, the benefit is that it moves the mutex into the low contention zone of OS threads instead of async tasks.

In all cases, we are assuming a large number of async tasks (M) and a small number of OS threads (N). In an M:N scheduling runtime, it doesn't really matter to the sync mutex what M is, because contention is bound to N. But with an async mutex, contention is bound to M.

That said, it is always possible to bound contention on the async mutex to N. But doing so is silly because you would pay for additional async task bookkeeping in the semaphore plus the inner sync mutex. Where just one external sync mutex would have the same amount of contention but without the extra overhead of the async mutex. Which, hopefully, comes full circle and answers the question.


  1. This is bespoke Communicating sequential processes - Wikipedia which is closely related to actor systems. ↩︎

3 Likes

In addition to what @parasyte said, I would like to add that you can turn on a lint for that by putting #![warn(must_not_suspend)] at the top of your lib.rs file. MutexGuard is annotated with #[must_no_suspend] which is currently an unstable feature. However note that this lint is not warn by default as of now because it’s not perfect yet and notably there are some false positives. Nevertheless, this is still a useful lint to enable if you’re going to have "risky" code like that.

EDIT: to clarify, I’m suggesting putting the following:

#![cfg_attr(feature = "nightly", feature(must_not_suspend))]
#![cfg_attr(feature = "nightly", warn(must_not_suspend))]

So you can check this lint when using a nightly compiler by enabling the "nightly" feature.

2 Likes

Is this only for browser, or does this apply to WASI and embedded-in-an-application too?

Good callout for clarification! I believe the issue is limited to browsers. There is a more detailed discussion here: Lack of atomic.wait on the main thread seems limiting to a fault · Issue #106 · WebAssembly/threads · GitHub

This is good to know, thanks.

I see taesookim's point, but I think that the use of mutexes has to be very carefully considered. This is life with concurrent code. Adding an await that is cross-mutex is something that must be done consciously.

So for me, I would much rather get a compile error or even a runtime error, rather than mistakenly introduce a cross-await mutex. Also it helps when reading the code to know which mutexes are intended to be cross-await (the async ones) and which are not (the sync ones). These two things have very different behavior.

@parasyte has already made some excellent points and linked to the relevant articles of mine, but it's also worth pointing out that a sync mutex can do things that the async mutex can't. For example, you can use it in destructors.

4 Likes

(The lock here is not held across await points.) You didn't take into account suspended async tasks. Let's say we have 8000 async tasks evenly distributed across 8 threads, each acquiring the same mutex. In your M:N model, M = 8000 and N = 8. When using a sync mutex, the first batch of 8 tasks (those at the head of task queue in each thread) do proceed quickly, but those at the tail have much longer waiting time. They are not waiting on the lock because they are not even scheduled to be running. Isn't this latency? Overall, a sync mutex doesn't provide shorter latency on average (that is, for all tasks) than an async mutex.

One thing you may argue is a sync mutex has a shorter waitlist. In this example, a sync mutex will have no more than 8 contenders at once, while an async mutex may have 8000 contenders. But the waitlist data structure is a queue and I wonder the queue length would have substantial effect on the speed of its push/pop operations.

(The lock here is held across await points.) Yes this is task-level contention, and it can only be synchronized with async locks. But you said it has much worse performance than thread-level contention? I question this, given that you didn't provide experimental data, and I can hardly find articles supporting thread-level synchronization primitives in async environments. In async programming a task is the basic execution unit. If synchronization between tasks is discouraged why even invent async locks.

Very interesting I found this, and it's for Rust:

I didn't read it in full, but I think we don't have async drop doesn't mean we can't have it.

Maybe, but it's not coming any time soon.

I believe you are rephrasing the last paragraph in my last reply. Yes, there is still latency because a mutex always serializes access.

You have much longer wait times with a much longer queue. More latency in aggregate. You can verify this claim by waiting in line at a supermarket for 5 minutes because 3 people are ahead of you, and wait in line for 2 hours at a theme park because 300 people are ahead of you.

I provided some basic arithmetic giving a rough estimate of the effect that contention has on latency. I also provided a link to an article that shows similar data (with graphs and a full description of the observations) for how contention affects locks. The article is not specific to async workloads, but that’s ok. The only difference is that the tasks are cooperatively scheduled instead of preemptively scheduled. Everything else applies.

Sometimes it’s the only thing you can use. But it would be good to know of the trade offs, in case you can redesign the code to work without an async mutex. Consider it motivation to make improvements.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.