How to predict deadlocks (corrected)?


#1

Using a wrapper for RwLock in a way like this

use std::sync::{Arc, RwLock};
pub struct MyLock<T> {
    inner: Arc<RwLock<T>>,
    name: String,
}

impl<T> MyLock<T> {
    pub fn new(v: T, name: &str) -> Self {
        Self {
            inner: Arc::new(RwLock::new(v)),
            name: name.to_string(),
        }
    }
    pub fn write<F, O>(&self, f: F) -> O
        where F: FnOnce(&mut T) -> O
    {
        println!("Writer acquiring {}.", self.name);
        let mut a=self.inner.write().expect("RwLock was poisoned");
        println!("Writer acquired {} - ok.", self.name);
        f(&mut a)
    }
}


fn main() {
    let val1:u64=5;
    let lock1=MyLock::new(val1, "Test lock No 1");
    println!("this must do");
    lock1.write(|a| *a=6);
    println!("this must not do");
    lock1.write(|_a| lock1.write(|a| *a=7));
}

how could it possibly be more informative in case of a deadlock?

If used extensively as it is written now, it would flood logs with pre-locking messages, of which only a small number would really be the deadlocks needed to debug.

If logging (with println in example above) is removed - using write() would panic without using its name information in the message.

And try_write() won’t distinguish in its error, whether WouldBlock is because of this or some other thread. And thus won’t tell whether it makes sense to wait until it’s unlocked, or better panic with detailed info about deadlock.

What’s the way to determine the blocking thread without panic? Or is it easier to change the panic message (occuring actually in certain place in platfrom-specific sys) in some way in the wrapper module, e.g. hooking it? Or better log some detailed message using panic::catch_unwind?


#2

Be careful! Your code may be exhibiting undefined behavior.

It might be worth checking out the lock_api crate, which has methods for waiting on a write lock until a timeout lapses, like try_write_for. And then handling the timeout as appropriate. (Unfortunately, it’s not that simple in reality; what’s a good choice for the timeout duration? Can your code continue to function if a write lock was unable to be acquired? How do you deal with poisoning? The answers to these questions are entirely application-specific.)


#3

Thanks. I understand it. But the code example is not intended to somehow allow, workaround or transparently hide the re-entrancy, it is meant to expose it and give more details (than the standard panic message) about time and place it occurs. So that the possible re-entrant calls could be cleaned out in other modules. After that this debug functionality in rwlock wrapper will be not that necessary. As long as it catches deadlocks on my test platform (as it does in playground also), it suffices for debug reasons, and when the re-entrancy is out, this lock code can be modified to something more portable and usable in free flight.

Lock API functionality with timeout may help in some way (with big enough timeout values, which could narrow to deadlocks in fact), though I didn’t find yet the poisoned behaviour in its documentation. Maybe panic::catch_unwind could be used here?


#4

Note that the code as provided doesn’t panic on macOS, it just hangs. If you want to always “do the right thing”, you’ll have to avoid reentrancy or use an API that explicitly supports it.


#5

Avoiding re-entrancy is just the thing I am trying to do through all the other code using this rwlock. That is why (in order to achieve it) I wanted the detailed information of the deadlock situation when it occurs. So that the source of the problem (which must be in the calling code) could be located and removed then.


#6

So far I came to something like this:

use std::sync::{Arc, RwLock};
use std::panic;

pub struct MyLock<T> {
    inner: Arc<RwLock<T>>,
    name: String,
}

impl<T> MyLock<T> {
    pub fn new(v: T, name: &str) -> Self {
        Self {
            inner: Arc::new(RwLock::new(v)),
            name: name.to_string(),
        }
    }
    pub fn write<F, O>(&self, f: F) -> O
        where F: FnOnce(&mut T) -> O
    {
        println!("Writer acquiring {}.", self.name);
        let trylock=panic::catch_unwind( || self.inner.write().expect("RwLock was poisoned") );
        let mut data=match trylock {
            Ok(a) => a,
            Err(e) => {
                println!("Writer acquiring {} - failed.", self.name);
                panic::resume_unwind(e);
            }
        };
        println!("Writer acquired {} - ok.", self.name);
        f(&mut data)
    }
}


fn main() {
    let val1:u64=5;
    let lock1=MyLock::new(val1, "Test lock No 1");
    println!("this must do");
    lock1.write(|a| *a=6);
    println!("this must not do");
    lock1.write(|_a| lock1.write(|a| *a=7));
}

It still must not always work as intended, and it would most probably hang on MacOS. But it does its job in some Linuxes, and gives a place in code where more details about the bad thing can be added. I could try it more on other platforms later.


#7

Looks like you updated the code recently. Now on the playground, it will print a stack trace:

thread 'main' panicked at 'rwlock write lock would result in deadlock', libstd/sys/unix/rwlock.rs:90:13
stack backtrace:
   0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace
             at libstd/sys/unix/backtrace/tracing/gcc_s.rs:49
   1: std::sys_common::backtrace::print
             at libstd/sys_common/backtrace.rs:71
             at libstd/sys_common/backtrace.rs:59
   2: std::panicking::default_hook::{{closure}}
             at libstd/panicking.rs:211
   3: std::panicking::default_hook
             at libstd/panicking.rs:227
   4: std::panicking::rust_panic_with_hook
             at libstd/panicking.rs:476
   5: std::panicking::begin_panic
             at libstd/panicking.rs:410
   6: std::sys::unix::rwlock::RWLock::write
             at libstd/macros.rs:72
   7: std::sys_common::rwlock::RWLock::write
             at libstd/sys_common/rwlock.rs:51
   8: <std::sync::rwlock::RwLock<T>>::write
             at libstd/sync/rwlock.rs:277
   9: <playground::MyLock<T>>::write
             at src/main.rs:18
  10: playground::main::{{closure}}
             at src/main.rs:31
  11: <playground::MyLock<T>>::write
             at src/main.rs:20
  12: playground::main
             at src/main.rs:31
  13: std::rt::lang_start::{{closure}}
             at libstd/rt.rs:74
  14: std::panicking::try::do_call
             at libstd/rt.rs:59
             at libstd/panicking.rs:310
  15: __rust_maybe_catch_panic
             at libpanic_unwind/lib.rs:102
  16: std::rt::lang_start_internal
             at libstd/panicking.rs:289
             at libstd/panic.rs:392
             at libstd/rt.rs:58
  17: std::rt::lang_start
             at libstd/rt.rs:74
  18: main
  19: __libc_start_main
  20: _start

… and it isn’t hanging on macOS. I’m still not entirely sure if this is safe, or if it will always behave this way (and identically across platforms). But combined with a stack trace, it’s pretty easy to see where the reentrancy occurs.


#8

Good to hear it doesn’t hang on Mac. But the curiousest thing is it’s quite unexpected, 'cause I didn’t change it that way. :slight_smile:
I slightly modified the first sample to add a second println, which allows hangup and normal execution to differ. And in second example there is this panic::catch_unwind, which should only work for these write() realizations which panicked in the first place. These who hung instead of panic - should remain so, but this is not the case, as you say. It’s intriguing.
As for stack backtrace - I thought it to be optional, in playground it was somewhere in advanced options under ‘…’ I did not managed backtrace in the code above.
Of course, backtrace tells more about it, it was just a bit not sufficient at topic start. Now the catch_unwind together with backtrace - they are much more useful for debug reasons.