Tokio and block_in_place cause program to deadlock

Hello, everyone. I have been stuck on this one for a few days. Hopefully, someone can help me with the issue I'm facing; I have inherited the project, so it could be that I'm missing some obvious bugs.

I want to immediately apologize for the long post; I wanted to add as much context as possible.
I think I have a bug similar to this one, but I'm not 100% sure.

Some context

The system I've been working on uses the Tokio runtime. Unfortunately, we have to implement sync trait(s) because the external crate requires it, and we have no control over it.
The kicker is that the work that has to be executed in these traits is async - I have to fetch some data using JSON RPC over IPC. Here is how I have approached this:

#[derive(Clone, Debug)]
struct FutureRunner {
    runtime_handle: tokio::runtime::Handle,
}

impl FutureRunner {
    /// Creates a new instance of  FutureRunner
    /// IMPORTANT: MUST be called from within the tokio context, otherwise will panic
    fn new() -> Self {
        Self {
            runtime_handle: tokio::runtime::Handle::current(),
        }
    }

    fn run<F, R>(&self, f: F) -> R
    where
        F: Future<Output = R>,
    {
        tokio::task::block_in_place(move || self.runtime_handle.block_on(f))
    }
}

AFAIK this is the correct way to handle async-sync-async bridging, and most of the calls are pretty short, usually a couple of hundred microseconds, but there are a lot of them - I'm ballparking here, but we are talking range of a couple hundred up to couple thousands per millisecond.

Here is the issue I've observed:

Deadlock

I have a worker task that will do the following:

  1. tokio::spawn clean job which
    1. Listens mpsc::bounded::reciver
    2. On event, does cleanup work
  2. Start the worker loop, which
    1. listens on mpsc::bounded::reciver channel
    2. On event, it does some processing

The issue is that both of these try to take the same parking_lot::Mutext when performing some work.
This hasn't been a problem so far because this was relatively short compute work.
However, now, during cleanup work, some of these internal calls use the method described above to perform RPC over IPC calls instead of just reading from memory.

Here is the tracing output at the point at which the app freezes (note here the lock is already held by the cleaner):

provider::remote state provider: Future:: start
provider::remote state_provider: Future:: block_in_place
provider::remote_state_provider: Future:: block on
order_input: Going to process order pool commands and take the lock

All these provider::remote state provider:* are printed by the cleaner.
After printing out ...take the lock, the whole program halts (even though it has enough threads at hand, both worker and blocking ones).
Note: this doesn't always happen, but it does happen often.

My guess

Here is my understanding of how Tokio works (I could be completely off base here; not an expert, not even close):

  1. Only one worker thread at the time can drive I/O:
    1. this was mentioned by Darksonn here
    2. And here we see it in code when Tokio parks the thread (link)
      1. To follow the rabbit hole: step 1, step 2, step 3 step 4 (or for the non-timeout version, we end up here) - this calls shared.driver.try_lock
  2. As far as I understand from the docs, even if my function is calling block_in_place, which calls block_on - all of which is happening on a blocking thread (so separate threadpool from the one in charge of runtime); still, the async tasks will be scheduled on the main runtime:
    1. According to the docs, "Any tasks or timers that the future spawns internally will be executed at runtime."

Given the two facts above, here is what I think is happening:

  1. Cleaner task takes mutex lock
    1. It runs block_in_place and block_on code which produces some tasks that are scheduled to the Tokio runtime
  2. Thread which executes worker loop at some point takes I/O driver lock, it reaches locked mutex, stops
    1. Since it is stopped, it cannot drive the I/O anymore
    2. The cleaner thread cannot proceed because it is blocked by block_in_place and block_on (which waits for the I/O tasks to be finished - which is not happening)
    3. Cleaner thread cannot release mutex
    4. Worker thread doesn't drive I/O
    5. Deadlock

I have "fixed" it by changing the way worker tread takes lock to try_lock_for(Duration::from_millis(10)). Ofc. this is far from a proper fix, but before doing it properly, I wanted to understand what was happening here.

I think this wouldn't've happened had:

  1. The worker has been spawned using std::thread::swpan
  2. The Tokio's mutex had been used since it has to be awaited

Is my analysis correct? If not, can someone please explain to me what's happening?

P.S.
The code is opensource so if need be I can share links to specific LoCs

Yes. The block_in_place is a complication, not essential; the fundamental problem here is that it is incorrect to use a blocking mutex from async code unless all uses of the mutex are “relatively short compute work”, because it can create deadlocks by violating the async runtime’s expectations that Futures will suspend whenever they cannot make progress.

You should also be very careful with having a mutex held for a long time even if its implementation is an async fn lock() variety — think about its interaction with other blocking/waiting and make sure you don’t get another deadlock even at the async level.

1 Like

Thank you very much for replying.
Regarding mutex locking, I think (need to benchmark), that initially this was relatively short computing work, but now that I've introduced IPC calls into this - it's not anymore.

Performance is irrelevant to the presence of deadlocks,[1] so you do not have to benchmark. You have to think of those as two completely different cases. The important part is that the code executed while holding the lock might then wait on something else (RPC IO, possibly other things). That's the key difference between the previous situation where using a blocking mutex was acceptable, and the current situation where it is not.


  1. though it can affect how likely a deadlock is to occur in practice ↩︎

1 Like