Mutex Poisoning: why, and how to recover

psionic12 · February 23, 2022, 1:13pm

I'm a little confused about the concept of mutex poisoning, and didn't quite understand even after reading the Rustonomicon.

According to the doc, if I understand right, a mutex could be poisoning if another thead panic when holding the lock, but if there's a panic, the whole program terminates, why should we care?

And what's more, in practice, what should I do to deal with the result returned from lock()? Currently, I just unwrap() it, but I followed a code style which asked to use unwrap as little as possible, so how could I recover the code if lock failed?

And I found that the crate parking_lot do not fail when locking, what did they do?

alice · February 23, 2022, 1:28pm

The poisoning is a signal that informs you that a previous call to lock panicked while holding the lock, and that your mutex contents may be broken (due to your own code not properly handling panics). The error that it returns also contains the mutex guard, so you can ignore the poisoning by accessing the poison error.

In parking_lot, they simply unlock it normally when you panic while holding it.

17cupsofcoffee · February 23, 2022, 1:37pm

This is only the case for the main thread (and assuming there's not a catch_unwind). If a spawned thread panics, the process will continue and you'll get the panic info returned as an Err from join.

SkiFire13 · February 23, 2022, 1:37pm

That's not true, only the thread in which the panic occurs is terminated.

The Err variant of the Result returned by lock() contains a PoisonError<MutexGuard> which allows you to get a MutexGuard anyway. Knowing this you can fix whatever you could have to fix in case another thread panicked while holding the lock. If you don't care to handle that case then you can just do .unwrap_or_else(|e| e.into_inner()). Note that .unwrap() is not that bad of a choice either: if another thread panicked then you probably have something wrong going on, so propagating the panic can be a wise choice.

H2CO3 · February 23, 2022, 1:41pm

Generally, all/most Rust code style guidelines will request that. However, poisoning is an exception — it doesn't mean that you aren't handling something that you did wrong. It means that you are propagating a panic someone else initiated. Thus, in general, it's fine to unwrap the result of lock().

psionic12 · February 23, 2022, 1:49pm

I'm a little confused of the term "panic", what I understand of panic is a crash, but it seems that I'm wrong, and it works differently with panic!, right?

When I write C++, if I do some bad thing in a thread, such as 1/0, or accessed a non-exsit address, the system will rise a signal to terminate the whole program, but apparently I'm wrong.

So I got some further questions:

What is a panic exactly in Rust?
Is there something similar (thread crashes but program lives) in C++?

psionic12 · February 23, 2022, 1:52pm

Who unlocks the lock? Do you mean that if thread A panicked when hold a lock, I can unlock it from thread B, on this poisoned lock?

alice · February 23, 2022, 1:55pm

Panics are similar to exceptions. They are not the same as a segfault.

As for who unlocks the lock, well, a panic runs the destructor of all local variables, and the destructor of the mutex guard will unlock it.

psionic12 · February 23, 2022, 1:58pm

I'm even more confused, you mean that even if a panic happens, the local variables' destructor(or drop function) still get called, so the lock guard will release the lock anyway, right? If so, the lock is released perfectly, why called it poisoned?

alice · February 23, 2022, 2:00pm

The standard library mutex guard destructor will check if the thread is panicking when the destructor runs, and sets the poisoned flag if so.

steffahn · February 23, 2022, 2:04pm

Yes.

The idea is that &mut self functions in Rust will usually assume they run to completion in order to restore invariants. E.g. assume an accounting struct that supports transferring money from one account to another with a &mut self method. If that would panic in the middle of such a transfer, it can mean that the whole accounting structure suddenly contains more money overall.

To help prevent such settings, there’s the auto-traits UnwindSafe and RefUnwindSafe. A mutable reference in Rust is not UnwindSafe, which is supposed to make it harder to

capture something mutably in a closure, and
then execute that closure with catch_unwind

which would result in the broken state of such an accounting structure to become visible to other code.

The poisoning of Mutexes serves the same purpose. If you have a global Mutex<AccountingStructure> and a money transfer panics (idk perhaps due to integer overflow, or there were some custom callbacks involved that could panic…), then the Mutex gets into a poisoned state to indicate to any future user that

if the type contained in the mutex can be left in “broken” states, then it might be in such a broken state now, because some operation accessing the mutex’s contents did panic
hence when writing code accessing this Mutex you an either conservatively always unwrap when locking, so that any poisoning would result in propagating the panics, or, if you know more about the contained data type and how it’s used throughout the rest of the program and thus are certain that there are no invariants that can be broken, you can ignore the poisoning and decide to access the value anyways

trentj · February 23, 2022, 2:09pm

In C++ panics are called exceptions, and they have basically the same unwinding behavior as in Rust. This is one reason why (C++)std::lock_guard exists, and is better than manually locking and unlocking a std::mutex: if the code holding a lock_guard returns early, throws an exception, or jumps past the unlock point, its destructor will still be called, releasing the lock. But in code that uses plain lock/unlock it is easy for an overlooked exception to accidentally unwind past the unlock, leaving the mutex locked ^[1].

C++ std::lock_guard does not distinguish between "normal" cleanup (return/continue/goto) and exception-unwinding, so when an exception is thrown while the lock is held the mutex will just be unlocked ^[2]. This is fine and is also how parking_lot's Mutex works. But if you're using a mutex to preserve some "library level" invariant, and an unexpected panic happens while that invariant is temporarily broken, it might be useful to signal that. So Rust's Mutex does distinguish between normal control flow and panic-unwinding, and sets the poison flag only in the second case.

and possibly causing UB, if the exception terminates the thread ↩︎
I'm pretty sure, anyway, based on reading docs; I haven't tested this ↩︎

psionic12 · February 23, 2022, 2:13pm

Please correct me if I'm wrong. So a "poisoned lock" illustrates that there are some unrecoverable failure happens to the operations it protected, not illustrates that the lock itself is functionally broken, so even if a lock is poisoned, it is still workable to other threads (if you choose to ignore the poison state)?

steffahn · February 23, 2022, 2:18pm

Yeah, the lock still works fine. If the lock wasn’t unlocked, you would just keep blocking; getting a PoisonError means you successfully locked the lock just fine and can access it safely (in terms of memory safety), just the value it contains might be logically broken/inconsistent in some ways.

You can ignore the poisoning by turning the PoisonError<MutexGuard<'_, T>> into a MutexGuard<'_, T> with PoisonError::into_inner and then access the value in the mutex just fine. Even if inconsistent states are possible in your use-case you can also decide to still handle the PoisonError by calling some sort of recovery/cleanup method that brings the protected value back into a non-broken state.

alice · February 23, 2022, 2:18pm

To be clear, this would only happen if your own code doesn't properly take into account that something in it could panic.

steffahn · February 23, 2022, 2:21pm

Well, it depends on what exactly “properly taking into account that something could panic” means. AFAIK it’s mostly only important to make sure that panics can’t result in memory unsafety (something you’ll only need to worry about when you use unsafe code in the first place). Restoring invariants might even be impossible unless you execute more code that could ultimately result in double-panic. So in light of ensuring proper cleanup, it’s sometimes better to leave things in an inconsistent state rather than risking an abort.

riking · February 25, 2022, 5:16am

Hmm, I feel like there should be a standard helper function Result<MutexGuard<'_, T>, PoisonError<MutexGuard<'_, T>>::ignore_poison() -> MutexGuard<'_, T> provided so that doing this is easy to spell.

steffahn · February 25, 2022, 7:13am

Currently the way to write this is something like .unwrap_or_else(PoisonError::into_inner).

system · May 26, 2022, 7:14am

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Should I unwrap() a mutex lock? help	8	4714	September 25, 2021
Poisoned mutex strategies help	3	329	December 21, 2020
Any examples of recovering from a poisoned lock?	4	1903	September 18, 2019
Is there a way to know where a mutex was poisoned from?	8	597	December 28, 2020
What, if any, are the weaknesses of https://crates.io/crates/parking_lot?	7	718	July 29, 2022

Mutex Poisoning: why, and how to recover

Related Topics