Concurrency bug - implementing Raft

Hello! I am trying to implement Raft in rust. I am referring to this implementation for parts that I am not sure of and my implementation is very similar to it. But there is a deadlock in my code that I am not able to resolve. So I figured it would be a good idea to have another pair of eyes to look for this concurrency bug. If someone is interested, please have a look. My current implementation: https://github.com/mdlfop/fun-with-raft.
PS: It's a hobby project and I am a novice. Any suggestions are welcome.
Thanks.

I gave a brief look, maybe I found the cause of the problem.

As you can see, you are calling handle_request_vote and handle_append_entry. If you look at the implementation of the two functions:

You can see that you lock the mutex at lines 250 and 273. However, request_vote and append_entry are called from:

You can see that you lock the mutex before calling append_entry, which, I think, is causing the deadlock. Try to work with these and see if they are the real issue.

3 Likes

Thanks a lot, @dodomorandi. I managed to solve the problem (I think) by releasing the lock before making the request_vote or append_entry grpc call. And then trying to reacquire the lock right after the gRpc call and before other steps (by expicitly maiknig use of drop) Is there any other way to do this? I know that I can create a new block when I lock the mutex.

let rs = enclosed.0.lock().await;
rs.debug(format!("sending RequestVote [{:?}]to {}", args, peer_id));
let remote = rs.peer_connections.get(&peer_id).unwrap().clone();
drop(rs);
let reply = { 
       let mut client = RpcClient::connect(remote).await.unwrap();
       client.request_vote(args).await.unwrap().into_inner()
};
let mut rs = enclosed.0.lock().await;
rs.debug(format!("received RequestVote reply: [{:?}]", reply));
if rs.state != ServerState::Candidate {
        rs.debug(format!("while waiting for the reply, state = [{}]", rs.state));
        return;
}

Also, I was aware that it's wrong to hold locks when you are blocking it (for example making the rpc call in my code), but I thought that as I was using tokio::sync::Mutex instead of the std one, it was somehow okay. Maybe I have to read about it some more.
And are there any tools or techniques in rust that can help me in detecting such errors fast? I have heard of loom tool but I don't know if it can be used in my case?
Once again, thanks a lot! :slight_smile:

You can try something I am going to tell you, but I am not totally sure it work 100% of the time in case of tokio::sync::Mutex.
You can create your own mutex module, with Mutex and MutexGuard structs that wraps the relative type. Then you can create a thread_local RefCell<Option<Backtrace>> from backtrace crate (or something similar) that is set with Some(Backtrace::new()) on Mutex::lock and with None on MutexGuard::drop. If when calling Mutex::lock a backtrace is already set, you panic indicating that the mutex was already locked in the same thread, giving some useful information where the first lock occurred.

I am not totally sure this approach will work for async Mutexes. For instance, I am not sure if the resulting future could be Send, and if the executor is allowed to move the coroutine between threads across suspension points -- probably not, but I am not 100% sure.

Said that, the best thing you could do is try to refactor the code to not use Mutexes at all and opt for different synchronization models like channels. It can be not so easy, mainly because it involves mental overhead for switching to a totally different approach. On the other hand, I already saw that, unfortunately, Mutexes do not scale well, and for bigger project they can be a PITA.
Try to approach the problem using different synchronization primitives. Remember that when you have a Mutex around, you are not really doing parallelism. If you are able to use channels or similar, you will probably obtain better performance.

1 Like