Poisoned lock in drop handler: Thread panicked while panicking

I have the following drop handler in my code:

struct CursorBackend {
    inner: *mut lmdb::MDB_cursor,
    closed: std::sync::Mutex<bool>,
    inner_txn: *mut lmdb::MDB_txn,
}

impl Drop for CursorBackend {
    fn drop(&mut self) {
        if !*self.closed.get_mut().unwrap() {
            unsafe {
                lmdb::mdb_cursor_close(self.inner);
            }
        }
    }
}

And somewhere else in the code:

impl CursorBackend {
    fn assert_txn_backend(&self, txn_backend: &TxnBackend) {
        let closed = self.closed.lock().unwrap();
        if *closed || self.inner_txn != txn_backend.inner {
            panic!("cursor does not belong to transaction");
        }
        drop(closed);
    }
}

Now I have a test like this one:

#[test]
#[should_panic(expected = "cursor does not belong to transaction")]
fn test_cursor_wrong_txn() {
    let db_opts = DbOptions::new()
        .key_type::<str>()
        .value_type::<str>()
        .keys_duplicate()
        .unnamed();
    let td = tempdir().unwrap();
    let env_builder = EnvBuilder::new().dir(td.path());
    let mut env = unsafe { env_builder.open_rw() }.unwrap();
    let db = unsafe { env.create_db(&db_opts).unwrap() };
    let txn1 = env.txn().unwrap();
    let cursor = txn1.new_cursor(&db).unwrap();
    txn1.abort();
    let txn2 = env.txn().unwrap();
    txn2.cursor_set_first(&cursor).unwrap();
}

Calling cargo test gives:

…
test tests::test_cursor_delete ... ok
test tests::test_cursor_dupkey ... ok
test tests::test_create_and_drop_db ... ok
test tests::test_wrong_db - should panic ... ok
thread panicked while panicking. aborting.
error: test failed, to rerun pass '--lib'

Caused by:
  process didn't exit successfully: `/usr/home/jbe/mycrate/target/debug/deps/mycrate-a76b036331fa3043` (signal: 6, SIGABRT: process abort signal)

YIKES! :cold_face: I don't even get a proper error message!

Using GDB, I could traceback the error to this line in the Drop implementation for CursorBackend, where a PoisonError gets unwrap'ed:

        if !*self.closed.get_mut().unwrap() {

I was able to fix the problem using this modifiction:

     fn assert_txn_backend(&self, txn_backend: &TxnBackend) {
         let closed = self.closed.lock().unwrap();
         if *closed || self.inner_txn != txn_backend.inner {
+            drop(closed);
             panic!("cursor does not belong to transaction");
         }
         drop(closed);
     }

While this fixes my immediate problem (I could also use parking_lot, which doesn't have poisoned mutexes), I have a couple of questions now :thinking::

  1. How do I write my drop handler correctly? Is it okay to .unwrap() a LockResult or must/should I handle the PoisonError always in a drop handler? And if so, how?

  2. If I use locks that can be poisoned, am I always responsible for my code to not panic while the lock is held? I always thought that panicking ends everything anyway (by default), but apparently that's not true. It looks like the stack is unwound and during that procedure a couple of values are dropped and their drop handlers executed?

  3. Does that mean my state must always be safe in such a way that when I panic nothing bad happens when values are dropped? This might also affect other conditions that do not involve locks but other states my values could be in, which are bad when dropped.

I always thought that panicking ends everything anyway (by default), but apparently that's not true. It looks like the stack is unwound and during that procedure a couple of values are dropped and their drop handlers executed?

Yes, and: In tests, the test harness catches panics so that it can report them while still running other tests, and in production applications (the paradigmatic example being a network server) panics might be caught in order to allow other work to be done even if some of it provoked a bug. This is done with catch_unwind(). So, a well-written library must be prepared to deal with being invoked after a panic. Sometimes this means thinking carefully and ensuring that things are done in the right order with the right cleanup so that there is no path to state corruption; sometimes “tell the caller something went wrong” (which is what PoisonError is, from std to you); sometimes aborting is the best option.

Does that mean my state must always be safe in such a way that when I panic nothing bad happens when values are dropped?

That is good practice, yes, and mandatory when “bad” means “unsound” as might happen around unsafe code.

It might help to understand that lock poisoning is not a problem you need to solve but an attempt to help you get this right — the mutex is poisoned to flag “something panicked while modifying this value, so it might be in an invalid state”. If you're prepared to deal with whatever that state might be, you should use PoisonError::into_inner() to retrieve the lock guard and continue.

I didn't quite follow your example enough to see where the lock gets poisoned, so that might be a “false positive” (some mutexes essentially can't get poisoned unless there's a bug in the code that directly owns them, i.e. your library), but if it isn't, your Drop needs to handle that.

How do I write my drop handler correctly? Is it okay to .unwrap() a LockResult or must/should I handle the PoisonError always in a drop handler?

Drop handlers should avoid choosing to panic because, as you've seen, it can lead to an immediate abort — so any known to be possible failure condition should be handled somehow. You might want to use PoisonError::into_inner() to access the contents anyway, or you might want to decide that the situation is unrecoverable and abort. For example, if it might be the case that mdb_cursor_close was already called, then either aborting or leaking the resource is a better idea than calling it again and causing a double-free.

3 Likes

Here is a minimal example to show the problem where cargo test doesn't show any useful error message:

struct DontDropMe {}

impl Drop for DontDropMe {
    fn drop(&mut self) {
        panic!("Oh, no!");
    }
}

#[cfg(test)]
mod tests {
    use super::*;
    #[test]
    #[should_panic]
    fn should_panic() {
        let _dont_drop_it = DontDropMe {};
        panic!("PAAANIC!");
    }
}

(Playground)

Output:


running 1 test

Errors:

   Compiling playground v0.0.1 (/playground)
    Finished test [unoptimized + debuginfo] target(s) in 2.85s
     Running unittests (target/debug/deps/playground-265d51af0bb4cc7f)
thread panicked while panicking. aborting.
error: test failed, to rerun pass '--lib'

Caused by:
  process didn't exit successfully: `/playground/target/debug/deps/playground-265d51af0bb4cc7f` (signal: 6, SIGABRT: process abort signal)

Interestingly, when I execute should_panic in main, I get a more helpful error message:

struct DontDropMe {}

impl Drop for DontDropMe {
    fn drop(&mut self) {
        panic!("Oh, no!");
    }
}

fn should_panic() {
    let _dont_drop_it = DontDropMe {};
    panic!("PAAANIC!");
}

fn main() {
    should_panic();
}

(Playground)

   Compiling playground v0.0.1 (/playground)
    Finished dev [unoptimized + debuginfo] target(s) in 1.12s
     Running `target/debug/playground`
thread 'main' panicked at 'PAAANIC!', src/main.rs:11:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'main' panicked at 'Oh, no!', src/main.rs:5:9
stack backtrace:
   0:     0x5600337c30cd - std::backtrace_rs::backtrace::libunwind::trace::hee598835bc88d35b
                               at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
   1:     0x5600337c30cd - std::backtrace_rs::backtrace::trace_unsynchronized::h9cdc730ba5cf5d72
                               at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2: …

Here I get both "PAAANIC!" and "Oh, no!" in the error output.

Maybe it would be nice if cargo test was a bit more verbose too?

@kpreid Thank you very much for your in-depth response. I will read into it and try to fully understand everything.

If I understand correctly, the panic which gets the lock poisoned is the one expected by should_panic attribute.

Yes, I think that is the case.

Phew, that means I should go through all my code again, I guess. This makes me wonder: Calling a C function can never cause a panic, right? How about out-of-memory errors? This will also not panic but simply abort the process, I guess?

That's also very good to know.

Oh, that's actually pretty cool. So when I acquire a lock, three things can eventually happen:

  • deadlock (I'll wait forever),
  • getting a MutexGuard (I know the state is clean),
  • getting a PoisonError (I know the state might be invalid), in which case
    • I can drop the PoisonError and keep my hands off (perhaps panicking or aborting or leaking resources or do whatever), or
    • I can ignore the PoisonError (e.g. by invoking mutex.lock().map_err(PoisonError::into_inner)) if I am capable of dealing with the possibly invalid state.

Now I wonder why lock_api::Mutex::lock (and thus parking_lot) doesn't support this? Is there a particular reason?

After checking my code again, I decided to change my drop handlers accordingly:

 impl<'a> TxnBackend<'a> {
     fn close_cursors(&mut self) {
         for cursor in take(&mut self.cursors).iter() {
+            // NOTE: mutex should never be poisoned
             *cursor.closed.lock().unwrap() = true;
             unsafe {
                 lmdb::mdb_cursor_close(cursor.inner);
 impl Drop for DbBackend {
     fn drop(&mut self) {
         if let Some(env_backend) = WeakArc::upgrade(&self.env_backend) {
-            let db_guard = env_backend.db_mutex.lock().unwrap();
-            unsafe {
-                lmdb::mdb_dbi_close(env_backend.inner, self.inner);
-             }
-            drop(db_guard);
+            // NOTE: Closing databases is optional; if mutex is poisoned, it's
+            // not necessary to close the database (and no more databases can
+            // be opened anyway).
+            if let Ok(db_guard) = env_backend.db_mutex.lock() {
+                unsafe {
+                    lmdb::mdb_dbi_close(env_backend.inner, self.inner);
+                }
+                drop(db_guard);
+            }
         }
     }
 }
 impl Drop for CursorBackend {
     fn drop(&mut self) {
+        // NOTE: mutex should never be poisoned
         if !*self.closed.get_mut().unwrap() {
             unsafe {
                 lmdb::mdb_cursor_close(self.inner);

My closed mutex can't be poisoned anymore, so I simply use .unwrap() here but documented that the mutex should never be poisoned. Another mutex (db_guard) can be poisoned though because I noticed that Vec::push can panic. So I refrained from using .unwrap() in the drop handler and instead "leak" the resource temporarily (as it's cleaned up later automatically anyway).

I remember there have been some discussions somewhere on whether it's right to abort the process on memory exhaustion. What's the current state of discussion if I may ask? Can any function that allocates memory panic? Should a function which may panic on resource exhaustion document that it might panic? Or should code only assume it can't panic if all used functions expliclty state that they can't panic? I have never seen documentation like "this function will never panic" as of yet.

It is OK to panic, but you just have to delay it until the running code leaves the curly brace block where the lock() call is located. Doing a panic inside a mutex block would trigger poisoning, since it could indicate de-synchronized data.

Calling a C function can never cause a panic, right?

In general, yes. Technically, perhaps C code could call the Rust panic handler symbol (but it's not going to do that if it's a "plain C library". Also, in general, unwinding across the FFI boundary (C++ exceptions, longjmp) is UB because there's no mechanism for interoperability, but if in the future some such interoperability existed, you wouldn't be able to rely on lack of panics when calling arbitrary C code.

How about out-of-memory errors? This will also not panic but simply abort the process, I guess?

So when I acquire a lock, three things can eventually happen:

That's a good list.

Now I wonder why lock_api::Mutex::lock (and thus parking_lot ) doesn't support this? Is there a particular reason?

I'm not familiar with the exact design intent, but from what I've heard in passing, there is a cost to adding poison tracking (both actually setting/checking the flag, and API fallibility), and many applications have no use for the poison flag, so the design choice was made to leave it out since an application can add its own poison flag if it likes.

Personally, I suspect this may be the wrong tradeoff because most code is not written with as much attention to correctness-in-the-presence-of-unwinding as it should be, but I don't have evidence to present on that matter.

Can any function that allocates memory panic?

Yes, unless it is using the the very few stable ways to fallibly allocate, such as Vec::try_reserve().

Should a function which may panic on resource exhaustion document that it might panic? Or should code only assume it can't panic if all used functions expliclty state that they can't panic? I have never seen documentation like "this function will never panic" as of yet.

The general advice is to document all ways a function can panic. In practice, memory allocation failure is often overlooked (as is numeric overflow).

1 Like

This was my way to solve the particular case:

It will ensure that the MutexGuard gets dropped first before the panic happens. That seemed easier (to me) than doing it with curly braces.

In case of the closed mutex above, I can ensure now that it will not be poisoned elsewhere.

That makes sense, and I'm also not sure about if it's a wise choice. Perhaps documentation on such mutexes should warn about it (not sure if it does and I just missed it). I wasn't aware of the dangers until I ran into signal 6 during testing as shown in my OP.

Interestingly, I stumbled upon the problem when I replaced parking_lot::Mutex with std::sync::Mutex. The parking_lot mutex didn't report any error. In my particular case, this wasn't causing UB (I assume) because my assert_txn_backend function didn't modify the state, but in other cases it may be dangerous.

But that depends on how the function deals with OOM, right? I don't see any documentation in Vec::push which indicates that it may panic on OOM, for example (if the capacity stays smaller than isize::MAX).

Thank you very much for your detailed explanations. This helps me a lot with understanding panics (and their dangers) better.

I haven't followed this thread closely, but here's an issue to read Re: pitfalls of unwinding from Drop::drop.

2 Likes

Good to know; thanks!

Apartently I made a mistake with terminology here. If I get it right, then aborting is a form of panicking (terminology-wise)?

I tried to read into the issue @quinedot brought up, and now I feel a bit more confused.

Apparently it's my task as a library developer to ensure that my state can never be left in an invalid condition as a panic could interrupt my code flow in a lot of calls (e.g. Vec::push) and then someone could catch the panic and call functions again? But isn't the auto trait UnwindSafe responsible for making sure this can't cause problems? I don't really fully understand it, but since it's not implemented for mutable references, I would assume a half-modified data structure (that is then in an invalid state) can't be accessed after a panic happens?

Let me try to summarize what I need to know as a library developer:

  • I should avoid panics in Drop::drop because these are hard to debug (and might also cause an abort, e.g. when doing cargo test as in my original example or generally when/if it is made a default behavior to abort on panic in drop handlers.
  • I must make my drop handlers capable of dealing with any state self might be in, which includes half-modified structs that could be in an invalid state due to a panic while working on the struct.
  • If I use locks that do not support poisoning, like parking_lot::Mutex (or lock_api::Mutex), then I must either implement my own poisoning mechanism or be ready to deal with half-modified / invalid states as well.

Not sure if I got this right or if I forgot something?

I'm also not sure whether Drop::drop is the only function that must deal with a half-modified / invalid state. Let me show to you the following example:

fn might_panic(unlucky: bool) {
    if unlucky {
        panic!("Failure!")
    }
}

struct S {
    vec: Vec<&'static str>,
    len: usize,
}

impl Drop for S {
    fn drop(&mut self) {
        // We don't call any method marked as unsafe here, but are we safe?
        println!("Dropping: {}", self.get_last());
    }
}

impl S {
    const fn new() -> Self {
        S {
            vec: Vec::new(),
            len: 0,
        }
    }
    fn add_element(&mut self, element: &'static str, unlucky: bool) {
        self.len += 1;
        // Is this dangerous? Panics can happen so often...
        might_panic(unlucky);
        self.vec.push(element);
    }
    fn get_last(&self) -> &'static str {
        // We assume valid state here. Is that bad?
        if self.len == 0 {
            "(none)"
        } else {
            unsafe { self.vec.get_unchecked(self.len-1) }
        }
    }
}

fn main() {
    let mut s = S::new();
    /*
    // The compiler keeps us from doing this:
    let result = std::panic::catch_unwind(|| {
        s.add_element("Apple", true);
    });
    */
    println!("{}", s.get_last());
    // But we can do this:
    s.add_element("Apple", true);
}

(Playground)

Output:

(none)

Errors:

   Compiling playground v0.0.1 (/playground)
    Finished dev [unoptimized + debuginfo] target(s) in 8.44s
     Running `target/debug/playground`
thread 'main' panicked at 'Failure!', src/main.rs:3:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
timeout: the monitored command dumped core
/playground/tools/entrypoint.sh: line 11:     8 Segmentation fault      timeout --signal=KILL ${timeout} "$@"

This is undefined behavior because the drop handler calls a method that assumes the struct is in a valid state. So far so good bad, but which method is to be considered the culprit here?

  • The implementation of Drop::drop, because it calls a method that doesn't cope with an invalid state (even though it's marked safe)?
  • The implementation of get_last because it should be marked unsafe, because we can never assume that the internal state is valid?
  • The implementation of add_element, which dares to panic while being in the middle of modifying &mut self?

:man_shrugging:

P.S.: And what's UnwindSafe good for if it doesn't protect us from the above segmentation fault? We have to be careful all the time anyway to never assume invalid states :slightly_frowning_face:. But maybe I don't fully understand it yet.

An aborting is a possible reaction to panic, but it can be triggered from the other sources, not only by panicking. In particular, "abort on OOM" is not panic, since it will always abort, never unwind.

It's not, of course. Moreover, Drop::drop should deal with invalid state, not "must deal", since the worst outcome (assuming there's no UB before) is aborting on double-panic; on the other hand, the unsafe code, as you've shown here, definitely must.

This is the problem. You have a safety invariant - self.len == self.vec.len(), which you rely on inside get_last. But add_element violates this invariant.

2 Likes

Okay, I can (but shouldn't) panic on an invalid state in the drop handler, but I must not trigger UB on an invalid state, got it (I think).

Okay, but then I don't understand why we have UnwindSafe and what it's supposed to help with? Apparently we can never rely on a state not being half-modified / invalid due to a panic. What is UnwindSafe good for, then?

In What is unwind safety? it is said that:

In Rust a function can “return” early if it either panics or calls a function which transitively panics. This sort of control flow is not always anticipated, and has the possibility of causing subtle bugs through a combination of two critical components:

  1. A data structure is in a temporarily invalid state when the thread panics.
  2. This broken invariant is then later observed.

Typically in Rust, it is difficult to perform step (2) because catching a panic involves either spawning a thread (which in turns makes it difficult to later witness broken invariants) or using the catch_unwind function in this module.

What this documentation doesn't mention is that it's actually pretty easy to perform step 2, because Drop::drop will receive a reference to the broken state (and might pass it to other functions). See modified Playground with unsafe code removed, which still behaves in a nasty way (but at least isn't unsound anymore).

It's supposed to help with catch_unwind, i.e. when you explicitly say "this code might panic, but I know what invariants can be broken and how to fix them afterwards".

But apparently I have to expect broken states always and everywhere, e.g. in my get_last method. Or does this only apply to drop handlers and functions called by drop handlers?

You don't have to expect broken state if the code which manages this state is correct - that's the whole point of encapsulation, after all. In your example, no code from outside the module can mess with the invariant (since fields of the structure are private), and you can and must make a full check that your code upholds the invariant you need. If the add_element was correct (that means, if it upheld the necessary invariant), Drop implementation would be no problem, too.

2 Likes

Okay, I think I understand now. Thank you very much!

One useful way to uphold state in the presence of unwinding panics is to use a drop guard; that is, a special type that cleans up the state when it is dropped. Since both normal falling-out-of-scope and unwinding panics execute destructors, the Drop code will always run, not unlike a finally block in a language with exceptions. The only way to prevent it is to forget the guard value. In the case of add_element(), let us suppose that might_panic depends on the temporary state self.len == self.vec.len() + 1 somehow. Then, we could write it with a drop guard:

fn add_element(&mut self, element: &'static str, unlucky: bool) {
    struct LenGuard<'a>(&'a mut usize);
    impl Drop for LenGuard<'_> {
        fn drop(&mut self) {
            *self.0 -= 1;
        }
    }
    self.len += 1;
    {
        let len_guard = LenGuard(&mut self.len);
        might_panic(unlucky);
        self.vec.push(element);
        std::mem::forget(len_guard);
    }
}

Finally, note that unsafe code is nefarious: its presence can place important restrictions on all other safe code around it. This is why we have to be so careful writing add_element(), as well as any other method that modifies self.vec or self.len. In practice, a useful tool is to define invariants for one's type, i.e., properties that must be upheld by all methods with access to its private data. Here, a sufficient safety invariant is self.len <= self.vec.len(), but to prevent logic errors, we want the stronger invariant self.len == self.vec.len().

1 Like

I think I heard the term "drop guard" before, but never read deeper into it (because I didn't understand unwinding well, back then).

How about providing an abstract DropGuard:

pub struct DropGuard<F: FnOnce()>(Option<F>);

impl<F: FnOnce()> DropGuard<F> {
    pub fn new(cleanup: F) -> Self {
        Self(Some(cleanup))
    }
    pub fn trigger(self) {
        drop(self)
    }
    pub fn release(self) {
        std::mem::forget(self) // edit: bad! this leaks resources, see reply to this post
    }
}

impl<F: FnOnce()> Drop for DropGuard<F> {
    fn drop(&mut self) {
        self.0.take().unwrap()()
    }
}

(Playground)

I guess this is so generic that it's already existent somewhere?