The following example reads a file in a streaming manner, rather than loading it entirely, and calculates the sha256 checksum.
use std::{error::Error, fs::File, io::Read};
use sha2::{Digest, Sha256};
fn main() -> Result<(), Box<dyn Error>> {
let mut reader = File::open("Cargo.toml")?;
let mut hasher = Sha256::new();
let mut buffer = [0u8; 8192];
loop {
let n = reader.read(&mut buffer)?;
if n == 0 {
break;
}
hasher.update(&buffer[..n]);
}
let checksum = hasher.finalize();
println!("{:x}", checksum);
Ok(())
}
Should I use std::io::BufReader like this?
let file = File::open(path)?;
let mut reader = BufReader::new(file);
The buffer in the example above is 8 KiB.
I'm working on Ubuntu, so BufReader::new(file) also uses the same size for its inner buffer, since it uses DEFAULT_BUF_SIZE, which is defined as follows:
I'm concerned that using two buffers of the same size might introduce unnecessary overhead.
I think modern operating systems cache loaded files in RAM, and I don't know the correct way to write benchmarks that measure I/O performance without being influenced by such caching mechanisms.
Testing in general and most recently with ripgrep.
4 KiB and 64 KiB are common cluster sizes for mass storage. A 64 KiB buffer allows the operating system to transfer an entire cluster in one call. The operating system does not need to split a cluster or perform intermediate buffering.
I assume using a cluster sized buffer coupled with something like FILE_FLAG_NO_BUFFERING allows the operating system to do DMA straight into our buffer. That would certainly speed up reads.
As buffer size goes, just to add to what's already been said, it depends a lot on the typical file size you're going to use and the type of disk. 4 KiB is a typical default for general-purpose disks, while 64 KiB or even 128 KiB is sometimes recommended for media storing very large files.
Also, don't forget that, as you wrote it, your buffer is stored on the stack.