Cancelling a distributed lock inside a Mutex from another thread causes a deadlock

ianzhang1988 · August 21, 2019, 10:30am

I implemented a distributed lock based on ZooKeeper and I want to test cancellation of the distributed lock. This causes a dead lock.

Here is code using AtomicUsize to simulate the ZooKeeper distributed lock (I chose AtomicUsize just because it's easy to share between thread, not sure if I using it in a preferred way).
Code updated: 2019-08-22
playground: AtomicUsize simulation

use std::{
    sync::{
        atomic::{self, Ordering},
        Arc, Mutex,
    },
    thread,
};

fn main() {
    // use AtomicUsize simulate zookeeper distrute lock;
    // acting as a remote lock mechanism
    let counter = Arc::new(atomic::AtomicUsize::new(1));

    let lock1 = DistrLock::new(counter.clone());
    let lock1_handle = Arc::new(Mutex::new(Box::new(lock1)));

    let counter_thread = counter.clone();
    let lock1_handle_thread = lock1_handle.clone();

    // simulate another machine use distribute lock
    let thandle = thread::spawn(move || {
        let mut lock2 = DistrLock::new(counter_thread);

        println!("lock 1");
        lock2.acquire().expect("debug thread1 1"); // distribute lock acquired

        thread::sleep_ms(10_000);

        println!("lock2 release");
        lock2.release().expect("debug thread1 2");
    });

    // second thread try cancelling lock1
    let thandle = thread::spawn(move || {
        thread::sleep_ms(2000); // lock1 mutex lock acquired in main thread wait for distribute lock

        println!("lock 3");
        lock1_handle_thread
            .lock() // wait for lock1 mutex release, dead lock because lock1 wait for distribute lock2 to release
            .expect("debug thread2 1")
            .cancel() // test cancel for distribute lock
            .expect("debug thread2 2");

        println!("lock 4"); // can not reach until lock 2 release,  means can't cancel lock1 until lock1 get the distribute lock
    });

    thread::sleep_ms(1000);

    println!("lock 2");
    lock1_handle
        .lock() // mutex lock acquired
        .expect("debug main 1")
        .acquire() // wait for distribute lock in_lock_2 to release
        .expect("debug main 1.1");

    lock1_handle
        .lock()
        .expect("debug main 2")
        .release()
        .expect("debug main 2.1");

    thandle.join().expect("debug main 3");
}

struct DistrLock {
    counter: Arc<atomic::AtomicUsize>,
    canceled: bool,
}

impl DistrLock {
    pub fn new(counter: Arc<atomic::AtomicUsize>) -> DistrLock {
        let lock = DistrLock {
            counter: counter,
            canceled: false,
        };

        lock
    }

    pub fn acquire(&mut self) -> Result<bool, ()> {
        loop {
            if self.canceled {
                return Ok(false);
            }

            if self.counter.load(Ordering::SeqCst) < 1 {
                thread::sleep_ms(500);
                continue;
            } else {
                // self.counter.store(self.counter.load(Ordering::Relaxed) - 1, Ordering::Relaxed);
                self.counter
                    .store(self.counter.load(Ordering::SeqCst) - 1, Ordering::SeqCst);
                return Ok(true);
            }
        }
    }

    pub fn release(&mut self) -> Result<(), ()> {
        // self.counter.store(self.counter.load(Ordering::Relaxed) + 1, Ordering::Relaxed);
        self.counter
            .store(self.counter.load(Ordering::SeqCst) + 1, Ordering::SeqCst);
        Ok(())
    }

    pub fn cancel(&mut self) -> Result<(), ()> {
        self.canceled = true;
        Ok(())
    }
}

There would not be any problem implementing this in Python or C++ because I would not need a mutex to protect ZkDistrLock, I could just call cancel in a second thread.

But in Rust, if I lose the Mutex, the compiler will complain that the use of lock1 is not thread safe.

Is there some way I can lose the Mutex or is there a different idiom that I can use?

My original implementation in the playground. The playground includes the distributed lock implementation but not all code here, can not compile

Yandros · August 21, 2019, 5:28pm

Both your acquire and release decrement the counter, this is very unlikely to be what you want;
```
pub fn cancel(&mut self) -> Result<(), ()> {
    self.canceled = true;
    Ok(())
}
```
This is also very unlikely to be what you want: your are setting a cancel: bool flag through a &mut, this by design means it cannot affect another concurrent thread (&mut means unique access).
```
self.counter.store(
    self.counter.load(Ordering::Relaxed) - 1,
    Ordering::Relaxed,
);
```
- Do not use Ordering::Relaxed; unsafe code may rely on the exclusive access granted by locks to perform mutation, and Ordering::Relaxed make it so the compiler may reorder the mutation out of the exclusive / critical section, thus leading to data races and Undefined behavior. If in doubt, use Ordering::SeqCst, or ::crossbeam::atomic::AtomicCell<usize> which uses this ordering under the hood.
- This is racy (not UB-level racy, since there is no unsafe, but logic-level racy). The point of using atomics is to do this kind of operations in one CPU intruction:
```
self.counter.fetch_sub(1, Ordering::SeqCst);
// or, with AtomicCell<usize>:
self.counter.fetch_sub(1);
```
I haven't really understood what you were trying to test in your main, but here is a cleaned version of your code with only the bare minimums. You will notice that even Arc wasn't needed, thanks to ::crossbeam scoped threads

shepmaster · August 21, 2019, 6:59pm

This was cross-posted to Stack Overflow.

ianzhang1988 · August 22, 2019, 4:14am

Indeed, should be increment in release.

I had update my code in main, hope it make sense. Generally, Machine 2 acquired a distribute lock. Machine 1 main thread want the same distribute lock, then Machine 1 second thread try cancel the operation, but can`t get the mutex until Machine 2 release the distribute lock.

Don`t know ::crossbeam yet, need some reading.

ianzhang1988 · August 22, 2019, 7:44am

Yandros:

pub fn cancel(&mut self) -> Result<(), ()> {
    self.canceled = true;
    Ok(())
}
This is also very unlikely to be what you want: your are setting a cancel: bool flag through a &mut , this by design means it cannot affect another concurrent thread ( &mut means unique access).

I'm not deeply understand &mut self, but code here, did work. Could you share some insight

Yandros · August 22, 2019, 10:59am

Something like this?

<ThreadId(2)> attempting to aquire the lock:
<ThreadId(2)> lock aquired, holding it for a while...
<ThreadId(3)> attempting to aquire the lock:
<ThreadId(3)> lock is not free, waiting...
<ThreadId(3)> lock is not free, waiting...
<ThreadId(3)> lock is not free, waiting...
<ThreadId(3)> lock is not free, waiting...
<ThreadId(3)> lock is not free, waiting...
<ThreadId(3)> lock is not free, waiting...
<ThreadId(3)> lock is not free, waiting...
<ThreadId(3)> lock is not free, waiting...
<ThreadId(3)> lock is not free, waiting...
<ThreadId(3)> lock is not free, waiting...
<ThreadId(3)> lock is not free, waiting...
<ThreadId(3)> lock is not free, waiting...
<ThreadId(3)> lock is not free, waiting...
<ThreadId(3)> lock is not free, waiting...
<ThreadId(3)> lock is not free, waiting...
<ThreadId(3)> lock is not free, waiting...
<ThreadId(3)> lock is not free, waiting...
<ThreadId(3)> lock is not free, waiting...
<ThreadId(2)> Done. Releasing it...
<ThreadId(3)> lock aquired, canceling it...
<ThreadId(2)> Attempting to reacquire the lock:
<ThreadId(2)> lock was canceled, as expected.

?

Playground

Yes, I should have clarified:

you are either building something to be directly shared (e.g., among multiple threads), in which case you need shared access to self, i.e., a &self.
Or you are building something to be wrapped in a Mutex (or a RwLock). Then you can have &mut (i.e., exclusive access) thanks to the Mutex creating critical sections when the lock is aquired. Using atomics, for instance, lets you mutate through a shared reference, thus leading to a usage without a wrapping Mutex (case 1.). But then you had that naked bool, mutated through a &mut, thus hinting at some private state not shared with others.

So you either decide to go "full 2.", and use a Mutex to wrap you struct; in which case you no longer need the atomic, and you could use a simple usize. But then you are no longer really creating a lock, since you would be relying on std's lock. Hence the suggestion to also use an atomic for the canceled state, which can now be shared and mutated across threads.

It is a crate featuring very useful primitives for multi-threading. The two features I have used in my example are:

AtomicCell<usize> / AtomicCell<bool>: these are the same as AtomicUsize and AtomicBool, except for the fact that they do not require an explicit Ordering: they default to using Ordering::SeqCst, which is the only sane default (usage of any other Ordering is, imho, one of the most difficult and subtle topics in computer science).

thread::scope this is a game / life changer, and is a very good candidate (one of the very few!) to be added to the ::std library: It allows spawning threads that borrow locals (in technical terms, the input closure's environment does not need to be 'static). In other words, no need to use statics or to wrap stuff in Arcs!

let some_local = String::from("Hello");
let hello: &str = &*some_local;

::crossbeam::thread::scope(|scope|{//---------------------------------

    let some_other_local = String::from("this one cannot be borrowed");

    // spawning a thread is done here,
    // between the two enclosing braces, by using:
    scope.spawn(|_| {
        // ... thread logic
        println!("{} from a first thread!", hello);
    });

    // you can spawn multiple threads
    scope.spawn(|_| {
        println!("{} from a second thread!", hello);
    });

    // you can spawn a thread that spawns another thread:
    scope.spawn(|scope| {
        println!("{} from a third thread!", hello);
        scope.spawn(|_| {
            println!("{} from a fourth thread!", hello);
        });
    });

})//-----all the spawned threads are automagically `.join()`ed--------
    // time to handle the case where some `.join()` failed
    .expect("Some thread panicked")
;

So the idea is that some_local / hello can be borrowed safely, since all the scope.spawn()ed threads are guaranteed to have ended (been .join()ed) by the time the thread::scope(...) call returns

they cannot, on the other hand, borrow some_other_local since it is created after the thread::scope( // ----- line and will thus be dropped before the safety .join() point.
- one thread can, however, take ownership of some_other_local by using a move |_| closure
(also note that the fourth thread can perfectly outlive the third)

ianzhang1988 · August 22, 2019, 12:30pm

Wonderful reply, Thanks.
Though it's some what different than what I have in my mind. But I Got the idea here, I shall try to transform those attribute into AtomicCell type. so I can lose Mutex, then I would be able to mutate those attribute between thread.

system · November 20, 2019, 12:36pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Thread::park_timeout deadlock caused by reentring call to thread mutex? help	7	862	January 12, 2023
Fix a deadlock in a simple channel help	6	153	February 27, 2024
Deadlock: Is it a bug, or is it intentional?	6	4921	January 12, 2023
Stuck trying to pass closure between threads help	2	1126	January 12, 2023
Mutex starvation help	2	780	May 11, 2023

Cancelling a distributed lock inside a Mutex from another thread causes a deadlock

Related Topics