Fastest way to create a hash of the contents of a file

I am a beginner trying to create a rust program that compares two folders (possibly over different machines, think file backup/Dropbox-like program) using hashes.
For this I want to have a reasonably (ideally: very) fast hashing algorithm that creates a unique id from a file on disk.

I have looked at seahash as well as xxhash_rust which seem very fast (both at 5+ GB/sec), but in my benchmarks I see only around 500 MB/sec (that includes read times as well, so its hard to directly compare the values).
When I test my harddrives (a NMVe SSD), I get around 2-3 GB/sec read speed (I use WSL 2 under Windows 11) so I am wondering if there is something that is wrong with my code.

MWE

A minimum working example is this:

First create a random file of size 3GB: head -c 3G </dev/urandom > example.bin

This is my code so far (note for the xxhash3 crate, we need to use cargo add xxhash-rust --features xxh3,const_xxh3)

use std::time::Instant;
use std::io::Read;

use seahash;
use xxhash_rust;


fn hash_file_seahash(filepath: String) -> u64 {
    let start = Instant::now();
    
    let mut file = std::fs::File::open(filepath).unwrap();
    let mut buffer = Vec::new();
    buffer.clear();
    let _ = file.read_to_end(&mut buffer);

    let res = seahash::hash(&buffer);
    let num_bytes = buffer.len();
    
    let dur = start.elapsed().as_secs_f64();
    let num_bytes = (num_bytes as f64) / 1_048_576f64;
    
    println!("Time hash_file_seahash(): in {:.2}s at {:.2} MB/s",
             dur, num_bytes as f64 / dur);
    res
}

fn hash_file_xxhash(filepath: String) -> u64 {
    let start = Instant::now();
    
    let mut file = std::fs::File::open(filepath).unwrap();
    let mut buffer = Vec::new();
    buffer.clear();
    let _ = file.read_to_end(&mut buffer);

    let res = xxhash_rust::xxh3::xxh3_64(&buffer);
    let num_bytes = buffer.len();
    
    let dur = start.elapsed().as_secs_f64();
    let num_bytes = (num_bytes as f64) / 1_048_576f64;
    
    println!("Time hash_file_xxhash(): in {:.2}s at {:.2} MB/s",
             dur, num_bytes as f64 / dur);
    res
}


fn main() {
    let filepath = String::from("example.bin");
    println!("File: {} ({:.2} MB)",
             filepath, std::fs::metadata(&filepath).unwrap().len() as f64 / 1_048_576f64);
    
    let seahash = hash_file_seahash(filepath.clone());
    println!("seahash: {}", seahash);
    println!("");

    let xxhash = hash_file_xxhash(filepath.clone());
    println!("xxhash: {}", xxhash);
}

Then I build and run the program with cargo build --release && target/release/problem
to get the output of

File: example.bin (3072.00 MB)
Time hash_file_seahash(): in 6.11s at 502.45 MB/s
seahash: 7840645262702756540

Time hash_file_xxhash(): in 5.56s at 552.35 MB/s
xxhash: 4352335203902048791

Question

How can I make this faster? When I just measure the reading of the bytes and leave out the hashing function, I still only see around 750 MB/sec (I would expect ~3GB/s from my SSD).
I have tried to read in the buffer with std::fs::read instead of read_to_end() but that didnt really help much.

1 Like

Are you compiling with -Ctarget-feature=+avx2? Otherwise xxhash won't use AVX2 instructions for better performance.

5 Likes

The fastest way to read a file is usually to memory map it.

use memmap2::Mmap;

let file = std::fs::File::open(filepath).unwrap();
// let mut buffer = Vec::new();
// buffer.clear();
// let _ = file.read_to_end(&mut buffer);
let buffer = unsafe { Mmap::map(file).unwrap() };

There's a bunch of options for Mmap that you can use on unix OSes that may make this faster.

You can also try incrementally hashing the file, which is closer to your version in compatibility but avoids having to fit the entire file in RAM.

The 5GB/s benchmark likely doesn't include loading the file, though, so trying to get 5GB/s including reading the file may be impossible.

Here's those approaches in code
use std::hash::Hasher;
use std::io::{BufRead, BufReader, Seek};
use std::path::Path;
use std::time::Instant;

use memmap2::Mmap;
use seahash;
use xxhash_rust;

fn hash_file_mmap<H>(filepath: &Path, name: &str) -> u64
where
    H: Hasher + Default,
{
    let start = Instant::now();

    let file = std::fs::File::open(filepath).unwrap();
    // let mut buffer = Vec::new();
    // file.read_to_end(&mut buffer).unwrap();
    let buffer = unsafe { Mmap::map(file).unwrap() };

    let mut hasher = H::default();
    hasher.write(&buffer);
    let res = hasher.finish();
    let num_bytes = buffer.len();

    let dur = start.elapsed().as_secs_f64();
    let num_bytes = (num_bytes as f64) / 1_048_576f64;

    println!(
        "{}: in {:.2}s at {:.2} MB/s",
        name,
        dur,
        num_bytes as f64 / dur
    );
    res
}

fn hash_file_buf<H>(filepath: &Path, name: &str) -> u64
where
    H: Hasher + Default,
{
    let start = Instant::now();

    let file = std::fs::File::open(filepath).unwrap();
    let mut file = BufReader::new(file);

    let mut hasher = H::default();
    loop {
        let buf = file.fill_buf().unwrap();
        let buf_len = buf.len();
        if buf_len == 0 {
            break;
        }
        hasher.write(buf);
        file.consume(buf_len);
    }
    let res = hasher.finish();
    let num_bytes = file.stream_position().unwrap();

    let dur = start.elapsed().as_secs_f64();
    let num_bytes = (num_bytes as f64) / 1_048_576f64;

    println!(
        "{}: in {:.2}s at {:.2} MB/s",
        name,
        dur,
        num_bytes as f64 / dur
    );
    res
}

fn hash_file_seahash_mmap(filepath: &Path) -> u64 {
    hash_file_mmap::<seahash::SeaHasher>(filepath, "xxhash mmap")
}

fn hash_file_xxhash_mmap(filepath: &Path) -> u64 {
    hash_file_mmap::<xxhash_rust::xxh3::Xxh3>(filepath, "seahash mmap")
}

fn hash_file_seahash_buf(filepath: &Path) -> u64 {
    hash_file_buf::<seahash::SeaHasher>(filepath, "xxhash buffered")
}

fn hash_file_xxhash_buf(filepath: &Path) -> u64 {
    hash_file_buf::<xxhash_rust::xxh3::Xxh3>(filepath, "seahash buffered")
}

fn main() {
    let filepath = "example.bin";
    println!(
        "File: {} ({:.2} MB)",
        filepath,
        std::fs::metadata(filepath).unwrap().len() as f64 / 1_048_576f64
    );

    let seahash = hash_file_seahash_mmap(filepath.as_ref());
    let seahash2 = hash_file_seahash_buf(filepath.as_ref());
    assert_eq!(seahash, seahash2);
    println!("seahash: {}", seahash);
    println!();

    let xxhash = hash_file_xxhash_mmap(filepath.as_ref());
    let xxhash2 = hash_file_xxhash_buf(filepath.as_ref());
    assert_eq!(xxhash, xxhash2);
    println!("xxhash: {}", xxhash);
}
5 Likes

Not always. In fact quite often not. @BurntSushi has an old blog post that among other things discusses the pros and cons of mmap in ripgrip: ripgrep is faster than {grep, ag, git grep, ucg, pt, sift} - Andrew Gallant's Blog

The tldr (as I understand it) is that mmap is a win for few large files but a loss on many small files.

6 Likes

Depending on your expected workload, it may be beneficial to compare files block-by-block instead of hashing the entire thing immediately: As soon as you find a difference, there is no need to read the rest of the file (as you know that they are different). Also, if the goal is to synchronize things, you can choose to transfer only the changed blocks instead of the entire file.

6 Likes

Thanks for all the answers so far. I will look into each of them with more care and report back.

@drewtato, when I run your code with the defaults on the 3GB binary file, I only get around 40MB/s.

@bjorn3 you mean like this? RUSTFLAGS='-C target-feature=+avx2' cargo build --release && target/release/problem?

Currently you are underutilizing your CPU by waiting for all the IO up front, at which time the CPU is doing no work. You could run hash function on blocks of the input in a separate thread. Then the previously read blocks can be hashed while you're waiting for more input from the drive. I would recommend processing blocks on this thread from an mpsc channel (i.e. queue) so your block size choice doesn't affect utilization, only syscall overhead.

4 Likes

Yes

1 Like

In my experience (yadf by me, fclones by @pkolaczk) as long as you don't use a cryptographic hashing algorithm, the particular algorithm used won't matter that much in this context.

Piotr has done much more work than me on this topic, so I hope he can give more insight.

It depends on your hardware but you probably need to do many parallel reads rather than one sequential read. If you hash four files at the same time, what's your read performance like?

1 Like

Sorry, I forgot to mention that the buffered one is unlikely to be faster unless you are low on RAM.

I tried these out (just the seahash ones for now) and they're pretty consistent. It does about 2.5GB/s with the mmap and buffered, and 3GB/s with the vec. I also tried seahash::hash vs seahash::SeaHasher to make sure, and they were mostly the same. This is on native Windows though, where I've found mmap isn't very fast, and that likely applies to WSL2. Native Linux would be faster.

Turning on AVX2 makes it slower. Probably also a Windows thing somehow.

If I don't count reading the file, the vec one does 9GB/s.

I'm guessing most of your slowdown is from WSL2. Try native Windows or Linux. Also for WSL2, make sure your files are on the Linux filesystem, not the Windows filesystem.

Lastly, you don't need to do cargo build --release && target/release/problem. It's the same as cargo run --release except for some environment variables.

3 Likes

Windows/WSL IO is actually stupid fast, it's FS you need to be careful about. (And that's mostly about minimizing syscalls by operating on file handles you open once rather than paths)

2 Likes

So far, I was running the code from /mnt/c/Users/..../ under WSL2 and was getting the mentioned 300MB/s, now I copied the target and the binary file to ~/.../ and I am getting 8-9GB/sec on mmap with xxhash.
So that seems to solve part of the issue.

@jessa0 I will look into mpsc channels and threads for multiple files. Thanks for the hint.
@erelde thanks for mentioning the other repositories, Ill have a look at the codebase and see what I can learn from them.

2 Likes

That's great! Also good to know is that code compiles much faster on the Linux FS.

1 Like

Did you try with BufReader?

That was here: