Unable to identify why this deadlock is happening

Most minimal example I could make https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=76c280d06791b537998620f3cca2db7a

Dependencies are crossbeam = "0.7.3" and parking_lot = "0.11.0".
Better run it in a new cargo project rather than the playground because the stdout formatting is messed up there.

The basic idea of this program is to simulate a simple p2p network. The first client already has the file (client.file has a vec filled with 1s). A new client joins every second with no file (client.file has a vec filled with 0s). Right after a new client is created, 2 threads are created for that client. The first thread (line 62-77) checks which parts of the client's file is 0 and puts a request in the queue for each. The second thread (line 80-97) takes requests out from the queue and changes those 0s into 1s.
The thread at the start (line 24-47) is for checking the progress and printing the output by looping over a vec containing all the clients, every 10ms.

If you run this you can see the percentages of each client keep increasing up to a point where it stops. This point varies from run to run. The intended behavior is for all clients to reach 100%. But this intended behavior has happened only a very few times. What could be causing the deadlock? And how do I stop that?

What is MutexGuard::unlock_fair? That's not a method from the standard library.

MutexGuard::unlock_fair is a method from parking_lot.

Unlocks this mutex using a fair unlock protocol.

From https://docs.rs/lock_api/0.4.0/lock_api/trait.RawMutexFair.html#tymethod.unlock_fair:

Fair unlocking means that a lock is handed directly over to the next waiting thread if there is one, without giving other threads the opportunity to "steal" the lock in the meantime. This is typically slower than unfair unlocking, but may be necessary in certain circumstances.

2 Likes
if req_qc.is_empty() {
    continue;
}
let req = req_qc.pop().unwrap();

Match on req_qc.pop() and continue in the Err case, instead. You might be silently killing your threads by accident.

1 Like

I just did this

let req = match req_qc.pop() {
    Ok(req) => req,
    Err(_) => continue,
};

But I'm still getting the deadlock.

I'll try to take a look at home, later, if nobody else has an idea. It's basically impossible to debug a deadlock on the playground due to timeouts. I also don't see any obvious deadlock. :slightly_frowning_face:

1 Like

I commented to this effect on Stack Overflow but I think the problem is in the unlock_fairs. You have dependencies between threads that aren't explicit, so you're trying to "balance" the scheduling algorithm by using a fair mutex, but it's just kicking the can down the road.

More specifically, I suspect you have a livelock where N workers are in competition for kN jobs, but each time a worker gets a job it can't do anything so just pushes the job back on the queue. However since the number of jobs is a multiple of the number of workers, the fair mutex algorithm means that the next time the job comes up in the queue it's always going to be the same worker, who still can't do anything with it, so he yields to the next worker. Whenever this happens for all kN jobs at the same time you get livelock.

Fair mutexes can make algorithms more performant when used judiciously, but if you're relying on the fairness of the mutex to make your algorithm correct you're playing a risky game. Parallel algorithms should usually be robust against stuff like a thread occasionally unlocking and immediately re-locking a mutex.

1 Like

This question has been answered on Stack Overflow.