Hashing a bunch of files in aparallel

Hey, I made a simple hashdeep like program that creates a list of hashes of all the files in a directory that can be audited latter to see if any of those files changed, its mostly a wrapper around the crypto create.

The source is available here: GitHub - jonnyso/hashgoblin: Simple hashdeep like, I would appreciate any feedback, but I was manly curious if it would be possible to make the hashing faster by doing it in parallel.

I don't know a whole lot about hashing in general and am just blindly following crypto's documentation for now :sweat_smile: , but eventually I intend to use this to hash a LOT of files and some very big ones too, here's how I'm currently doing it:

Currently I'm using this function to hash multiple files in parallel, but is it possible to also do it for a single file ?

fn hash_file(path: &Path, hasher: &mut dyn DynDigest, cancel: &AtomicBool) -> io::Result<Hashed> {
    let mut reader = BufReader::new(File::open(path)?);
    loop {
        if cancel.load(Ordering::Acquire) {
            return Ok(Hashed::Canceled);
        }
        let data = reader.fill_buf()?;
        if data.is_empty() {
            break;
        }
        let length = data.len();
        hasher.update(data);
        reader.consume(length);
    }
    Ok(Hashed::Value(encode(hasher.finalize_reset())))
}

A SSD can have more concurrent reads than you have CPUs available. But hashing algorithms generally can't make use of thread parallelism, but they already make use of parallel instructions (avx & co).

In short, read as many files as possible* concurrently and dump their content as fast as possible into the hash function (link to my old work, which I want to revise at some point to allow any hash size).

*within reason, be a nice neighbor, just spawn as many threads as you have CPUs

1 Like

Most hash function cannot be parallelized due to their specification. Blake3 is a more recent hash function that can internally use parallelism. The rust implementation with the rayon feature enabled provides an update_rayon method that uses parallelism internally for faster hashing, but it is faster only with a very large buffer (128KiB), which means BufReader is not enough, and you must use your own buffering. There is also the update_mmap_rayon method that uses memory mapping and only expect the file path, which is what I would use If I were you. The documentation advises against using this on a spinning hard drive however.

But first, you should make sure that the CPU is the bottleneck and not the storage. If the storage is the bottleneck, more parallelism will make things worse.

1 Like

Oh ! I hadn't even though about that. This thing will need to run on spinning hard drives. I'll look into it, thanks !