I have a unit test being run via #[tokio::test(flavor="multi_thread", worker_threads=2)] which deadlocks. I have tried to strip it down to a minimal reproducer, but have been unable. The Mutex is a from Tokio, and I'm on the latest version of the library. The "offending" code I have been able to track down to the following function:
pub async fn is_freed(&self) -> bool {
let state_clone = self.state.clone();
trace!("is_free: {:?} {:?}", Arc::as_ptr(&self.state), self);
let ret = tokio::spawn(async move {
trace!("BEFORE");
let ret = {
let lock = state_clone.flags.lock().await;
lock.is_freed()
};
trace!("AFTER: {}", ret);
ret
}).await;
trace!("is_free returning: {:?}", ret);
ret.unwrap()
}
The code never gets past the .await on the JoinHandle. The first question someone will have is why I am spawning a task at all, and the answer is to more clearly/easily demonstrate that my .await never finishes. The same happens if I remove the spawn call, and await the is_free function. When calling the above, I only see the following in the console (ie, I never see the is_free returning message):
What could cause an .await to not return? Nowhere else in my code do I cancel tasks, and the unit test never finishes. I do have another task that is in a pretty tight loop trying to grab the same lock; however, it uses try_lock(), and keeps on looping if it cannot acquire the lock. This seems like a deadlock, and I know that if I call .lock() on the same Mutex twice from the same task, Tokio will deadlock; however, by AFTER: false I should have dropped the MutexGuard, so I don't think that is what is happening.
It sounds like you are blocking the thread. This can prevent other tasks from being executed. You should fix the code that is spinning on the lock and have it use an .await instead.
You might want to add a yield_now() call above the flags.lock() call to be completely sure.
Anyway, I recommend you give tokio-console a try to debug this. Are there any tasks whose busy duration keeps going up while the poll count is constant?
There are only 2 tasks shown. The top one, both the busy and poll counts continue to increase, and it corresponds (Location) with the "spinning task". The second task appears to have only been polled once:
If I did call lock().await on the same Mutex from the same task, causing a deadlock, how could I go about detecting that? How can I go about preventing it?
I continued to trace through the code, and the "main task" was creating a stream which had a bad poll_next implementation that was not properly polling the future. I eliminated this Stream implementation by using stream::unfold to create the stream, and everything seems to be working as expected. Thanks @alice for the pointers towards Tokio console. While it is seemingly a powerful tool, without your clue/help of "Are there any tasks whose busy duration keeps going up while the poll count is constant", I wouldn't really know what I was looking at.