Testing IO and formatting performance

rusty_ron · February 28, 2018, 2:50pm

Using @vitalyd suggest of a straight replacement results in...

test tests::convert_file_test ... bench: 301,735,429 ns/iter (+/- 33,628,944)

But given his suggestion to extract out the file IO vs processing, now puts things under a different light

fn convert_file(filename : &Path) -> Result<(),Error> {
    let output_file = filename.with_extension("csv");
    let mut input = File::open(&filename)?;
    let mut buf = Vec::with_capacity(100000);
    match input.read_to_end(&mut buf) {
        Ok(_n)  => {},
        Err(_e) => {}
    }
    let mut outfile = File::create(output_file)?;
    outfile.write_all(&buf)?;
    Ok(())
}

benches at
test tests::convert_file_test ... bench: 229,081,335 ns/iter (+/- 23,722,839)

And alternatively comparing to C/C++

λ .\StrReplaceCC.exe new.txt
Starting with std::string
CPP Runtime in mSecs: 0
Starting with C stdlib functions
C Runtime in mSecs: 0
Total Runtimes
CPP with std::string: 0.000000
C with stdlib: 0.000000
Press return to continue

is too small to measure.

So that suggests one is focusing on the wrong thing. The Rust code is as good, if not better than C; but the FileIO is where the issue is. This is really useful for me as I can use this in a positive light, if it's IO bound; as I can push the safety/robust aspects at zero-cost.

I would be curious to know why the fileio discrepancy; but maybe another thread.

vitalyd · February 28, 2018, 3:07pm

I just tried this locally:

// Uses BufWriter for writing
test tests::read_file_buf     ... bench: 144,907,250 ns/iter (+/- 50,086,575)
// Same as above but overrides buf size to 1MB
test tests::read_file_buf_1MB ... bench: 139,750,095 ns/iter (+/- 5,630,681)
// Uses my original suggestion of extend()'ing the output vec
test tests::read_file_extend  ... bench: 183,862,886 ns/iter (+/- 4,007,337)
// Uses replacement loop with input and output Vec buffers
test tests::read_file_replace ... bench: 112,261,435 ns/iter (+/- 2,968,478)

The problem, of course, is including I/O in here is going to add substantial noise.

Yeah, this can be any number of things: buffer sizes, which syscalls are used, whether any of these writes may trigger a foreground page cache writeback (on linux, say), any noisy I/O neighbors running at the same time, etc. Benchmarking I/O is its own kind of beast .

HadrienG · February 28, 2018, 4:11pm

I got a ~10% performance boost (170ms -> 150ms) by replacing the initial file read with an mmap call, encapsulated into the memmap crate for portability:

#![feature(test)]

extern crate memmap;

use memmap::Mmap;
use std::path::Path;
use std::fs::File;
use std::io::{Error, Write};
use std::env;

fn convert_file(filename: &Path) -> Result<(),Error> {
    // Load the input data
    let input_file = File::open(&filename)?;
    let input = unsafe { Mmap::map(&input_file) }?;

    // Generate and write the output data
    let mut output_file = File::create(filename.with_extension("csv"))?;
    let capacity = ((input.len() as f32) * 1.05) as usize;
    let mut outbuf = Vec::with_capacity(capacity);
    for email in input.split(|&ch| ch == b',') {
        outbuf.extend(email);
        outbuf.extend(b",\n");
    }
    output_file.write_all(&outbuf)?;

    Ok(())
}

fn main() {
    let args : Vec<String> = env::args().collect();
    if args.len() < 2 {
        println!("No filename given");
        return;
    }
    convert_file(Path::new(&args[1])).unwrap();

}

#[cfg(test)]
mod tests{
    use super::*;
    extern crate test;
    use self::test::Bencher;
    #[bench]
    fn read_file(b: &mut Bencher) {
        b.iter(|| {
            let _ = convert_file(Path::new(&"new.txt"));
            0
        });
    }
}

Multiple caveats apply, however:

As you can see, memory-mapping a file is considered unsafe by the memmap crate. The documentation is not very clear about why, but I suspect that is because it is very easy to violate Rust's memory safety guarantees if a given file is mapped multiple times.
The performance characteristics of memory-mapped files can vary a lot from one system to another.

EDIT: Also, whether a newline is inserted after the comma makes a big difference in my measurements (150ms with a newline, 130ms without), so you really want to include it in order to be fair to the C/++ versions.

Speaking of which, I managed to shave off some extra cycles by rolling my own BufWriter ^^'

This version, using a regular BufWriter, is ~10% slower (back at 170ms) than the "generate full output and write" version above for any buffer capacity. I suspect the issue is that the BufWriters introduce too many conditionals in the tight inner loop...

fn convert_file(filename: &Path) -> Result<(),Error> {
    // Load the input data
    let input_file = File::open(&filename)?;
    let input = unsafe { Mmap::map(&input_file) }?;

    // Generate and write the output data
    let output_file = File::create(filename.with_extension("csv"))?;
    let mut output = BufWriter::with_capacity(400_000, output_file);
    for email in input.split(|&ch| ch == b',') {
        output.write_all(email)?;
        output.write_all(b",\n")?;
    }
    output.flush()?;

    Ok(())
}

...but this version, where I essentially roll my own BufWriter, can be up to ~10% faster (~137ms) than the "write everything at once" version:

fn convert_file(filename: &Path) -> Result<(),Error> {
    // Load the input data
    let input_file = File::open(&filename)?;
    let input = unsafe { Mmap::map(&input_file) }?;

    // Generate and write the output data
    const CAPACITY: usize = 400_000;
    const MAX_LEN: usize = CAPACITY - 1024;
    let mut output_file = File::create(filename.with_extension("csv"))?;
    let mut outbuf = Vec::with_capacity(CAPACITY);
    for email in input.split(|&ch| ch == b',') {
        outbuf.extend(email);
        outbuf.extend(b",\n");
        if outbuf.len() >= MAX_LEN {
            output_file.write_all(&outbuf)?;
            outbuf.clear();
        }
    }
    output_file.write_all(&outbuf)?;

    Ok(())
}

EDIT 2: One thing that might be worth pondering is that processing 67MB of input data and outputting them back to mass storage in 137ms amounts on average to a bidirectional storage throughput of ~500 MB/s, or in other words an average total data traffic of ~1GB/s.

My local SSD isn't that fast (more like ~250MB/s reads and ~120 MB/s writes according to a quick dd bench), so the only reason why I can reach this speed is that the Linux kernel is caching files in RAM across benchmark runs behind my back. Making this faster would likely entail tuning for the implementation specifics of the Linux disk cache in a fashion that may not be directly portable to other operating systems.

Which begs the question: is this "everything is in cache" scenario a realistic benchmark for your target workload? Or haven't we already reached the limit where a real-world job would be mostly limited by HDD or SSD I/O?

urschrei · March 1, 2018, 11:52am

Just as a matter of interest, have you turned on LTO and single codegen unit for your benchmark and release builds?

[profile.release]
lto = true
codegen-units = 1

[profile.bench]
lto = true
codegen-units = 1

Also (and this may not make any difference, but may be worth checking) have you tried explicitly inlining your functions with #[inline]?

Topic		Replies	Views
Rust vs. C++: Fine-grained Performance	51	7319	January 12, 2023
Success story: new Rustacean beating C perf in first week	13	5982	May 9, 2019
Rust faster than C++ and Ada 2012 on a simple file processing benchmark	10	1644	January 12, 2023
Rust IO speed versus other languages help	15	5003	January 12, 2023
Performance help between c code and rust code help	4	1503	January 12, 2023

Testing IO and formatting performance

Related topics