The point that I can't seem to get across, despite my best efforts, is that lock contention is what matters. A multithreaded async runtime limits the number of OS threads to available CPU cores, meaning there is an upper bound to how much contention is even plausible for the sync mutex.
Let's suppose that the rough average duration for holding a lock is 50 µs. With 8 threads, the median wait while acquiring the lock under contention (every thread racing to acquire) will be about 200 µs; 50 µs * 8 threads / 2 to estimate the median. That's well within reason. With 1,000 threads, the median wait is 25 ms! Which is far, far worse because there is much more contention from many threads waiting on a single lock. In both cases, the lock serializes access to about 20,000 per second, but with low contention any number async tasks still make progress with a median lower bound of 200 µs. The end result is similar throughput for both systems, but lower latency for the async system with 8 threads. Contention matters.
This doesn't yet explain what's "wrong" with the async mutex, but it sets the stage for what the actual concern is: tasks contending on a shared resource.
Going back to the same reference I provided before: Shared state | Tokio - An asynchronous Rust runtime
If a large number of tasks are scheduled to execute and they all require access to the mutex, then there will be contention. On the other hand, if the current_thread runtime flavor is used, then the mutex will never be contended.
This is another way to state the same thing, but they've reduced N to its logical conclusion: 1 for the current_thread scheduler where no contention is observable.
The only way that async tasks can contend for the lock is by holding it across await points. And the only way to do that is with an async mutex. So, if you're using an async mutex because the lock can be held across the await points, you are introducing task-level contention that you cannot get with a sync mutex. And if there are a large number of async tasks, you are going to have much worse performance overall from this architectural decision.
For these reasons, the suggestion to "just use a sync mutex" is sound advice. Or at least a good default to start with. The suggestion to choose other options over the async mutex when you do need to access a shared resource over an await point is also sound, for reasons I will discuss shortly. Here are the alternatives recommended by the linked article:
- Switching to a dedicated task to manage state and use message passing.
- Shard the mutex.
- Restructure the code to avoid the mutex.
The first of these is personally my favorite . The reason I prefer it is because it's really hard to screw it up, super simple to implement, and it keeps control flow sequential without introducing callbacks. It's also purely async, so that readers can see the await points, and there is no superfluous blocking.
Sharding the mutex is a way to further reduce contention, which is good for reasons that should be obvious, having discussed contention above.
Restructuring the code is not always possible. But if you can get away with it, the benefit is that it moves the mutex into the low contention zone of OS threads instead of async tasks.
In all cases, we are assuming a large number of async tasks (M) and a small number of OS threads (N). In an M:N scheduling runtime, it doesn't really matter to the sync mutex what M is, because contention is bound to N. But with an async mutex, contention is bound to M.
That said, it is always possible to bound contention on the async mutex to N. But doing so is silly because you would pay for additional async task bookkeeping in the semaphore plus the inner sync mutex. Where just one external sync mutex would have the same amount of contention but without the extra overhead of the async mutex. Which, hopefully, comes full circle and answers the question.