HELP: gRPC Server Stops Responding

I have a gRPC server built with Tonic that is exhibiting the terrible behavior of getting into a state where it is running but not handling any requests.

By "handling" I mean that the Tonic service methods are not even called when the server is hit. The first line of each such method is a trace (info!) that is never printed. Clients requests time out, and the server shows no sign of being hit. (Yet it is bound to the IP, and tcpdump on the VM shows the incoming request.)

There is no memory leak nor high CPU usage: When this occurs, RAM is <5Mb as always, as shown by htop, and CPU 0%.

There are no errors preceding the state - nothing logged (every error condition has a log, using tracing), and nothing in the stdout/stderr redirect.

I am worried this is some bug in Tonic.

The server is constructed and run like this:

fn main() -> Result<(), Box<dyn Error>> {
    let runtime = tokio::runtime::Builder::new_multi_thread()
        .enable_all()
        .build()
        .expect("Failed to construct Tokio Runtime");

    ...

    match runtime.block_on(
        Server::builder()
            .add_service(service1::new())
            .add_service(service2::new())
            .serve(address),
    ) {
        Ok(_) => Ok(()),
        Err(e) => {
            error!(".serve() error: {}", e);
            Err(Box::new(e))
        }
    }

The only other thing possibly noteworthy about the app is that it bridges non-async code with this utility:

    pub fn run<F, R>(f: F) -> R
    where
        F: Future<Output = R>,
    {
        task::block_in_place(move || runtime::Handle::current().block_on(f))
    }

This is always called from the context of the Runtime created above (this is the only Runtime created in the app).

Any ideas on things to check, etc.? I realize this is not a complete description of the app - I do no have a minimal reproducible sample - but I think it is all the pertinent information.

Is there anything in general that can cause behavior like this from tonic/tokio?

Is it Ok to call runtime::Handle::current().block_on from with an async fn?

I found this thread warning of calling "block_on" from an async context, but it seems it is referring to futures::executors::block_on.

I see no such warning in the Tokio docs.

I'm just going over anything in the system that is slightly suspect.

I think the answer is no. I think this may well be the problem! There was a single instance of this (block_on in an async fn) but that bit is called at the head of handling every request...

Symptoms make me suspect a deadlock. Possibly within Tonic, possibly in your application code.

Can you crank up the logging to TRACE or DEBUG and see any issues that way.

It might also be possible to point a tokio console at it.


Ah just saw your follow up, that block_on looks like the prime suspect. (I'm suspicious of block_in_place too since it contains the word block!)

Have you read Async: What is blocking? – Alice Ryhl?

Tokio is a multi threaded runtime, so blocking is generally dangerous. But read that article to learn more, I think I'm going to reread it again now :smiley:

1 Like

I had enabled debug-level trace and also suspect this single block_on causing deadlock was the culprit... will update the thread.

1 Like

Yeah, that was definitely it.

Code inspection demonstrated it, but testing proved it: I've got an integration testers that submits 1 million requests from 1,000 clients. No issues.

I feel pretty stupid. I think deadlocks in general with Rust have been the most challenging thing so far, for, coming from a pure functional Scala background.

1 Like