Got stuck on waiting for JoinHandle

I'm using tokio to deal with async I/O for writing bytestream of gRPC requests. I'm creating a blocking function to bridge the world between async and sync, since my entrypoint is synchronous. I followed the second pattern in the tokio briding with sync code and spawning each async I/O in as a separate task. The following is a sample code snippet:

impl NonBlockingClient {
    pub fn new() -> Result<NonBlockingClient> {
        let rt = tokio::runtime::Builder::new_multi_thread()
            .enable_all()
            .build()?;
        let bs_client = rt.block_on(create_bs_client())?;
        Ok(Self {
            rt: rt,
            bs_client: bs_client,
            handles: vec![],
        })
    }

    /// writes a large file
    pub fn write_file(&mut self, digest: &Digest, path: &Path) -> Result<()> {
        let mut bs_client = self.bs_client.clone();
        let digest = digest.clone();
        let path = path.to_path_buf().clone();

        let handle = self
            .rt
            .spawn(async move { bs_write_file(&mut bs_client, &digest, path).await });
        self.handles.push(handle);
        Ok(())
    }

    pub fn wait(&mut self) -> Result<()> {
        for handle in self.handles.iter_mut() {
            println!("wait on {:?}", handle);
            let res = self.rt.block_on(handle);
            if res.is_err() {
                println!("failed to upload {}", res.unwrap_err());
            }
        }

        Ok(())
    }
}

pub(crate) async fn bs_write_file(
    client: &mut BsClient,
    digest: &Digest,
    path: PathBuf,
) -> Result<()> {
    println!("write_file: {:?} path: {:?}", digest, path);
    let f = File::open(path).await?;
    let stream = WriteRequestStream::new(f, digest);

    client
        .write(stream)
        .await
        .map(|_v| ())
        .map_err(|e| anyhow::Error::msg(format!("failed to write file blob {}", e)))
}

I learned join_all is problematic from multiple sources (eg. this one), so I implemented the join for all handles by myself

However, the code could randomly stuck (but with high chance) to got stuck on wait on JoinHandle.

It will be great if here could spare some advise on how to debug such issue? Or is there anything I'm doing is off?

On its own the code seems fine, as long as you don’t ever use the NonBlockingClient from within the runtime itself.

By the way, you may be interested in anyhow::Context which allows you to write

    .context("failed to write file blob")

as shorthand for your current map_err

Your code seems fine. When the task exits, your block_on(handle) call should return. Are you sure the tasks really are successfully exiting?

One concern I have is that by using iter_mut, the handles remain in the handles vector after they have been fully awaited. This is somewhat dangerous since awaiting them after they have completed will lead to a panic. (But it will not cause it to get stuck.)

If you are looking for something to help you debug it, well, the first thing I would try is just to add some println statements at the end of bs_write_file to verify that your task really is returning from its function. If that isn't enough, you should try out the tokio-console tool.

Good to know. Will try that.

Thx for the suggestions. More testing shows now I sometimes also get Error: Too many open files (os error 24). Since I'm dealing with large amount of files, this error makes more sense to me. But I'm not sure if this is related to the earlier stuck. Since this new error happens more to me now (with the same test data), I'm going to try limit the number of concurrent files and see if this helps.