The delicate dance of the sync->async bridge

I've followed the fairly common pattern in one app of walling-off the async portion. That portion is the use of a gRPC service, meaning Tonic, which is async-only.

The service that wraps this API uses a helper like this

    fn run<F, R>(&self, f: F) -> R
    where
        F: Future<Output = R>,
    {
        tokio::task::block_in_place(move || tokio::runtime::Handle::current().block_on(f))
    }

to make calls to that API. (It turns out that this service is also in an async context but ignore that.)

This is - fine. Sort of.

It's not really fine because block_on always carries the risk of deadlock. (run() above is used in the API calls to first obtain a connection, which involves a Tokio Mutex lock.)

If there were no guard, the server being inundated with requests could result in all of the Tokio Runtime threads waiting in block_on calls - and we're dead.

I explored using a oneshot channel as an alternative to block_on. It works, but it is really messy and verbose.

(The fact that you can't create a Runtime within one is really a frustrating limitation - a transient single-thread Runtime would be a great way to handle this.)

I settled on using a Semaphore to date the number of active gRPC calls. If the max number of possible concurrent block_on calls is kept less than the number of Runtime threads, we can't get a deadlock.

This is not a bad idea except then I discovered that Rust's std lib Semaphore is depracated.

Before I hunt down an alternative (non-async) semaphore implementation - any other ideas?

I would say that you don't have a “bridge” between sync and async; you have a flooded zone full of hazards.

You should avoid block_in_place(), and instead design so that each place where async code is called is either:

  • known to be an async context, in which case you .await, or
  • known to be a blocking context — that is, you have no async code above it on the stack — in which case, you call block_on().

Trying to serve both cases with one function will lead to messes, and block_in_place() is hazardous because it doesn't compose with other async operators like select!().

In most cases, any given thread should be either dedicated to async (as part of Tokio's worker pool or as a block_on()) or dedicated to blocking operations (possibly including an occasional block_on() for async-only operations). Then you have a well-defined boundary to “bridge” over in specific cases.

That's not workable. Being in an async context isn't the same as using async functions, and of course you can only .await in async blocks.

What I was trying to prevent was completely viral async.

We have something like this:

A->B->C->D

Four layers. A is the entrypoint to the server - gRPC services. D is the layer that calls the third-party gRPC API. B & C are existing, general, non-async services.

Calls to D are in a Runtime context because A is (Tonic). But they have no async code.

But I see your point. I am being pushed to the conclusion that corralling async is futile - and B & C and the rest of those layers (which is a huge amount of code) should be rewritten as async.

This is disappointing.

Aha, the typical async (A) -> sync (B & C) -> async (D) call chain. This is indeed painful in Rust (and Python[1]). I can be hardly convinced that rewriting B & C into async is the way, it just feels wrong. I ended up using the oneshot channel approach, even though it's verbose.

P.S. Have you tried pollster? It might suit the a transient single-thread runtime need, but it's not tokio.


  1. I wrote almost the same fn run in Python, and it's awful. But at least it works (most of the time). ↩︎

I'm not saying that you have to rewrite everything as async. I'm saying that broadly (there may be specific exceptions), you want to run your blocking code and async code on separate threads.

If A is async and B and C are blocking, then you can insert a spawn_blocking() when you call B from A. This is more well-behaved than block_in_place().

Note that you are in a runtime context because there is a tokio handle available in a thread local variable, not because there is an async function "up the stack". From the point of view of D it's not much different than if you created a new tokio runtime and called enter on it to setup that thread local, and the fact there's an async function "up the stack" just makes the situation worse!

I went down that path once and something took me off it. Can't recall what. I'll give it another try.

I know (the first part).

How does an async function up the stack make the situation "worse"?

(Any way to remove that threadlocal var? :grinning_face_with_smiling_eyes:)

Tokio maintainer here. When it comes to calling async -> sync -> async, it's not possible to do better than block_in_place. Trying to circumvent the protections that Tokio come with will lead to deadlocks that can be really difficult to debug (or worse). In general, the answer is "just don't do it".

To understand why, first a bit about how async works under normal circumstances. The way that async/await works is that it makes it possible to run millions of async tasks on the same thread. The idea is that every time you run an .await, Tokio can swap the currently running async task for another async task. This means that creating async tasks can be much much cheaper than real OS threads, meaning that you can easily run millions of tasks at the same time for cheap. However, if any task spends a long time without reaching an .await, then this means that no other async task gets to run. This is called "blocking the thread", and I have a blog post about that topic that you may find helpful to read.

Anyway, back to the topic of async -> sync -> async. The problem is that due to the design of async/await, it's impossible for Tokio to swap one task for another when the current task is inside a synchronous region, even if it enters an asynchronous region internally. Now, Tokio does have the utility block_in_place as an escape hatch to make it sort of work, but it still doesn't allow for switching. When you use the function, it just tells Tokio to spawn a new worker thread and run other tasks on the new thread, since the current thread is unavailable for other tasks.

If you can, it's much better to place calls into synchronous code in spawn_blocking. This way, they do not take up one of your Tokio runtime threads.

One way to think about it is that you have three different kinds of functions, not two. They are: async, blocking, and quick. What determines whether a non-async function is blocking or quick is essentially whether it's guaranteed to complete running within a very short amount of time. You can perform the following calls:

  • Async -> async. Use an await.
  • Async -> quick. Normal function call.
  • Async -> blocking. Use spawn_blocking.
  • Blocking -> blocking. Normal function call.
  • Blocking -> quick. Normal function call.
  • Blocking -> async. Use block_on.
  • Quick -> quick. Normal function call.
  • Quick -> blocking. Not possible.
  • Quick -> async. Not possible.

So the problem is that when you perform a normal function call from async, you must call into a quick function. If you call into a blocking function, you have a bug in your code, and you may be forced to use block_in_place with all of its disadvantages, or it may just be impossible to do what you want. In these scenarios, the real fix is to call async -> blocking with spawn_blocking instead of using a normal function call.

Thanks for chiming in, Alice.

I am familiar with basically everything in your post.

So your own recommendation for the use-case I outlined is this:

which I was already using, and which introduces the possibility of deadlock, which was the entire motivation for this post.

I think I am best off rewriting those Layer D async fn signatures to return their result via a oneshot channel. Yes, the API changes, but I am not seeing a better way.

You are missing a step, since your usecase is async -> blocking -> async, you need to also consider the async -> blocking step. The suggestion for that is this:

Which you are currently not following, and that's the root causes of your deadlocks.

No, there is no possibility of deadlock if used correctly.

It needs to be one of the block_on functions defined by Tokio (most likely you want Handle::block_on). As long as you use one of those, deadlock isn't possible because these functions detect incorrect usage and panic in any scenario that could lead to a deadlock.

As @SkiFire13 comments, you are missing the async -> sync part, which must be performed using spawn_blocking. If you do that, then the block_on call doesn't happen on a runtime thread, and there's no issue with exceeding the number of runtime threads.

You're right. My bad. I got spawn_blocking and block_in_place confused there.

I've been using this:

    pub fn run<F, R>(f: F) -> R
    where
        F: Future<Output = R>,
    {
        task::block_in_place(move || runtime::Handle::current().block_on(f))
    }

I've never experienced deadlocks - but have been told it is possible.

Thank you so much. I think I led myself be led astray but our smartest co-worker (ChatGPT).

I'll investigate using spawn_blocking, but, for the record, is this completely safe (from deadlock)?

    pub fn run<F, R>(f: F) -> R
    where
        F: Future<Output = R>,
    {
        task::block_in_place(move || runtime::Handle::current().block_on(f))
    }

There is nothing intrinsically wrong with that function by itself, but if you call it (and thus block_in_place) from async code you don't fully control and know has no select! or timeouts or anything concurrent-within-a-task, then you could cause a deadlock because block_in_place prevents the intended control flow from happening.

The problem with this is that I wanted to keep all of this encapsulated in layer D, the API wrapper calling the external, async API.

If this is done, the call from C->D (sync to async) will need some bridge also.

EDIT: Which is the run() util I'm already using. Apologies for missing the fact that it is running that from spawn_blocking that makes that safe.

Exploring it further, the problem with a oneshot channel isn't verbosity, it's the fact that the closure has to be 'static, which means none of the existing code compiles.

I wrote this:

    pub fn block_on_via_channel<F, R>(fut: F) -> R
    where
        F: Future<Output = R> + Send + 'static,
        R: Send + 'static,
    {
        let (tx, rx) = oneshot::channel();

        runtime::Handle::current().spawn(async move {
            let result = fut.await;
            let _ = tx.send(result);
        });

        rx.blocking_recv()
            .expect("Async task panicked or runtime shut down")
    }

but can't call it with self methods, etc.

Going that route would entail a massive refactoring.

pollster: It seems it's intended for projects not using Tokio.

Unfortunately I wasn't able to follow all of that.

I think it's the case that no deadlock can result as long as all Tokio threads don't end up in block_on. Is that the case? If so, a semaphore approach, limiting the number of concurrent API calls (all of the async functions that need to use this mechanism are part of a service that wraps one API), would work. Do you concur?

I've only ever used tokio::runtime::Handle.block_on, like this:

    pub fn run<F, R>(f: F) -> R
    where
        F: Future<Output = R>,
    {
        task::block_in_place(move || runtime::Handle::current().block_on(f))
    }

Either I'm still missing some nuance or you're saying something different than some other commenters here. If this cannot deadlock, then my initial information was wrong and I need do nothing more.

After more experimentation, this has turned out to be the best solution - almost the only general solution for calling sync->async that doesn't require a lot of refactoring, as it doesn't impose static lifetime requirement since there's no spawning.

Thinking this thru again I'm sure this can deadlock:

  • f awaits another future in the same runtime (they will, sometimes)
  • that future can't be scheduled if all worker threads are busy (in block_in_place)

I don't think there's any way around a significant refactoring of the application to avoid this. Either I go to a spawn/channel method, imposing 'static lifetimes everywhere, or I adopt Alice's suggestion of corralling off way up at the A layer, using spawn_blocking.

EDIT: Again I apologize for missing the point that running within a spawn_blocking context makes block_on safe.