Should I use io::BufReader when streaming a file?

The following example reads a file in a streaming manner, rather than loading it entirely, and calculates the sha256 checksum.

use std::{error::Error, fs::File, io::Read};

use sha2::{Digest, Sha256};

fn main() -> Result<(), Box<dyn Error>> {
    let mut reader = File::open("Cargo.toml")?;
    let mut hasher = Sha256::new();
    let mut buffer = [0u8; 8192];

    loop {
        let n = reader.read(&mut buffer)?;
        if n == 0 {
            break;
        }
        hasher.update(&buffer[..n]);
    }

    let checksum = hasher.finalize();
    println!("{:x}", checksum);

    Ok(())
}

Should I use std::io::BufReader like this?

let file = File::open(path)?;
let mut reader = BufReader::new(file);

The buffer in the example above is 8 KiB.
I'm working on Ubuntu, so BufReader::new(file) also uses the same size for its inner buffer, since it uses DEFAULT_BUF_SIZE, which is defined as follows:

pub const DEFAULT_BUF_SIZE: usize = if cfg!(target_os = "espidf") { 512 } else { 8 * 1024 };

I'm concerned that using two buffers of the same size might introduce unnecessary overhead.

I think modern operating systems cache loaded files in RAM, and I don't know the correct way to write benchmarks that measure I/O performance without being influenced by such caching mechanisms.

this is redundent, you are already using a buffer, essentially you have implemented your own "unrolled" BufReader.

BufReader is mainly used to reduce IO system call overhead when the data access pattern is in small pieces, such as parsing or deserialization.

2 Likes

Instead of using a buffer of a random fixed size, you can also directly stream from the reader to the writer:

use std::{error::Error, fs::File, io::copy};

use sha2::{Digest, Sha256};

fn main() -> Result<(), Box<dyn Error>> {
    let mut reader = File::open("Cargo.toml")?;
    let mut hasher = Sha256::new();
    copy(&mut reader, &mut hasher)?;
    let checksum = hasher.finalize();
    println!("{:x}", checksum);
    Ok(())
}
5 Likes

64 KiB is a better choice.

This blank assertion warrants an explanation.

1 Like

Testing in general and most recently with ripgrep.

4 KiB and 64 KiB are common cluster sizes for mass storage. A 64 KiB buffer allows the operating system to transfer an entire cluster in one call. The operating system does not need to split a cluster or perform intermediate buffering.

I assume using a cluster sized buffer coupled with something like FILE_FLAG_NO_BUFFERING allows the operating system to do DMA straight into our buffer. That would certainly speed up reads.

1 Like

As buffer size goes, just to add to what's already been said, it depends a lot on the typical file size you're going to use and the type of disk. 4 KiB is a typical default for general-purpose disks, while 64 KiB or even 128 KiB is sometimes recommended for media storing very large files.

Also, don't forget that, as you wrote it, your buffer is stored on the stack.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.