I am a beginner trying to create a rust program that compares two folders (possibly over different machines, think file backup/Dropbox-like program) using hashes.
For this I want to have a reasonably (ideally: very) fast hashing algorithm that creates a unique id from a file on disk.
I have looked at seahash as well as xxhash_rust which seem very fast (both at 5+ GB/sec), but in my benchmarks I see only around 500 MB/sec (that includes read times as well, so its hard to directly compare the values).
When I test my harddrives (a NMVe SSD), I get around 2-3 GB/sec read speed (I use WSL 2 under Windows 11) so I am wondering if there is something that is wrong with my code.
MWE
A minimum working example is this:
First create a random file of size 3GB: head -c 3G </dev/urandom > example.bin
This is my code so far (note for the xxhash3 crate, we need to use cargo add xxhash-rust --features xxh3,const_xxh3
)
use std::time::Instant;
use std::io::Read;
use seahash;
use xxhash_rust;
fn hash_file_seahash(filepath: String) -> u64 {
let start = Instant::now();
let mut file = std::fs::File::open(filepath).unwrap();
let mut buffer = Vec::new();
buffer.clear();
let _ = file.read_to_end(&mut buffer);
let res = seahash::hash(&buffer);
let num_bytes = buffer.len();
let dur = start.elapsed().as_secs_f64();
let num_bytes = (num_bytes as f64) / 1_048_576f64;
println!("Time hash_file_seahash(): in {:.2}s at {:.2} MB/s",
dur, num_bytes as f64 / dur);
res
}
fn hash_file_xxhash(filepath: String) -> u64 {
let start = Instant::now();
let mut file = std::fs::File::open(filepath).unwrap();
let mut buffer = Vec::new();
buffer.clear();
let _ = file.read_to_end(&mut buffer);
let res = xxhash_rust::xxh3::xxh3_64(&buffer);
let num_bytes = buffer.len();
let dur = start.elapsed().as_secs_f64();
let num_bytes = (num_bytes as f64) / 1_048_576f64;
println!("Time hash_file_xxhash(): in {:.2}s at {:.2} MB/s",
dur, num_bytes as f64 / dur);
res
}
fn main() {
let filepath = String::from("example.bin");
println!("File: {} ({:.2} MB)",
filepath, std::fs::metadata(&filepath).unwrap().len() as f64 / 1_048_576f64);
let seahash = hash_file_seahash(filepath.clone());
println!("seahash: {}", seahash);
println!("");
let xxhash = hash_file_xxhash(filepath.clone());
println!("xxhash: {}", xxhash);
}
Then I build and run the program with cargo build --release && target/release/problem
to get the output of
File: example.bin (3072.00 MB)
Time hash_file_seahash(): in 6.11s at 502.45 MB/s
seahash: 7840645262702756540
Time hash_file_xxhash(): in 5.56s at 552.35 MB/s
xxhash: 4352335203902048791
Question
How can I make this faster? When I just measure the reading of the bytes and leave out the hashing function, I still only see around 750 MB/sec (I would expect ~3GB/s from my SSD).
I have tried to read in the buffer with std::fs::read
instead of read_to_end()
but that didnt really help much.