Hash files in parallel using asynchronous filesystem operations

Hi,
I have a Vec<PathBuf> that I need to convert to a Vec<[u8; 32]> corresponding to the file hashes of the PathBufs. Since hashing is a CPU intensive operation, I'm using rayon to iterate over the paths. However, my project uses tokio and so I don't want to simply hash as that would block the current thread.

Currently I do the following:

let paths: Vec<PathBuf> = get_paths();
let hashes = paths
    .into_par_iter()
    .map(|file_path| -> Result<(PathBuf, [u8; 32])> {
        futures::executor::block_on(async {
            let hash = hash(tokio::fs::File::open(&file_path).await?).await?;
            Ok(hash)
        })
    });

where

pub async fn hash<T>(reader: T) -> crate::Result<[u8; 32]>
where
    T: tokio::io::AsyncRead + std::marker::Unpin,

Hash continuously reads the file into a buffer and then copies that buffer into the hasher, allowing the file to be hashed without reading it all into memory at once.

However, I'm very inexperienced when it comes to asynchronous code and parallelism. Is using futures::executor::block_on sound in this case? Should I use tokio::task::block_in_place and just use blocking filesystem operations?

Maybe you could use the Stream API ?

I probably would not be doing it like that. Instead, I would call tokio::task::spawn_blocking for each file and then first read the file using std::fs::read and then also hash the file in the blocking thread.

Have you read this article?

This is what I ended up doing. According to my rudimentary benchmarks, it's as fast as using rayon parallel iteration. I was just confused about whether it could be done with asynchronous file operations.

I have read that article - it says that for CPU-bound computations, spawn_blocking is suboptimal, but also, you shouldn't use synchronous I/O with rayon. Since hashing is CPU-intensive and uses I/O, I wasn't sure whether to use tokio or rayon.

For something like this, its fine to just put it in spawn_blocking. Hashing isn't that expensive.

Async file IO just calls spawn_blocking internally, so there's no advantage from that.

1 Like