I got a ~10% performance boost (170ms -> 150ms) by replacing the initial file read with an mmap call, encapsulated into the memmap crate for portability:
#![feature(test)]
extern crate memmap;
use memmap::Mmap;
use std::path::Path;
use std::fs::File;
use std::io::{Error, Write};
use std::env;
fn convert_file(filename: &Path) -> Result<(),Error> {
// Load the input data
let input_file = File::open(&filename)?;
let input = unsafe { Mmap::map(&input_file) }?;
// Generate and write the output data
let mut output_file = File::create(filename.with_extension("csv"))?;
let capacity = ((input.len() as f32) * 1.05) as usize;
let mut outbuf = Vec::with_capacity(capacity);
for email in input.split(|&ch| ch == b',') {
outbuf.extend(email);
outbuf.extend(b",\n");
}
output_file.write_all(&outbuf)?;
Ok(())
}
fn main() {
let args : Vec<String> = env::args().collect();
if args.len() < 2 {
println!("No filename given");
return;
}
convert_file(Path::new(&args[1])).unwrap();
}
#[cfg(test)]
mod tests{
use super::*;
extern crate test;
use self::test::Bencher;
#[bench]
fn read_file(b: &mut Bencher) {
b.iter(|| {
let _ = convert_file(Path::new(&"new.txt"));
0
});
}
}
Multiple caveats apply, however:
- As you can see, memory-mapping a file is considered unsafe by the memmap crate. The documentation is not very clear about why, but I suspect that is because it is very easy to violate Rust's memory safety guarantees if a given file is mapped multiple times.
- The performance characteristics of memory-mapped files can vary a lot from one system to another.
EDIT: Also, whether a newline is inserted after the comma makes a big difference in my measurements (150ms with a newline, 130ms without), so you really want to include it in order to be fair to the C/++ versions.
Speaking of which, I managed to shave off some extra cycles by rolling my own BufWriter ^^'
This version, using a regular BufWriter, is ~10% slower (back at 170ms) than the "generate full output and write" version above for any buffer capacity. I suspect the issue is that the BufWriters introduce too many conditionals in the tight inner loop...
fn convert_file(filename: &Path) -> Result<(),Error> {
// Load the input data
let input_file = File::open(&filename)?;
let input = unsafe { Mmap::map(&input_file) }?;
// Generate and write the output data
let output_file = File::create(filename.with_extension("csv"))?;
let mut output = BufWriter::with_capacity(400_000, output_file);
for email in input.split(|&ch| ch == b',') {
output.write_all(email)?;
output.write_all(b",\n")?;
}
output.flush()?;
Ok(())
}
...but this version, where I essentially roll my own BufWriter, can be up to ~10% faster (~137ms) than the "write everything at once" version:
fn convert_file(filename: &Path) -> Result<(),Error> {
// Load the input data
let input_file = File::open(&filename)?;
let input = unsafe { Mmap::map(&input_file) }?;
// Generate and write the output data
const CAPACITY: usize = 400_000;
const MAX_LEN: usize = CAPACITY - 1024;
let mut output_file = File::create(filename.with_extension("csv"))?;
let mut outbuf = Vec::with_capacity(CAPACITY);
for email in input.split(|&ch| ch == b',') {
outbuf.extend(email);
outbuf.extend(b",\n");
if outbuf.len() >= MAX_LEN {
output_file.write_all(&outbuf)?;
outbuf.clear();
}
}
output_file.write_all(&outbuf)?;
Ok(())
}
EDIT 2: One thing that might be worth pondering is that processing 67MB of input data and outputting them back to mass storage in 137ms amounts on average to a bidirectional storage throughput of ~500 MB/s, or in other words an average total data traffic of ~1GB/s.
My local SSD isn't that fast (more like ~250MB/s reads and ~120 MB/s writes according to a quick dd bench), so the only reason why I can reach this speed is that the Linux kernel is caching files in RAM across benchmark runs behind my back. Making this faster would likely entail tuning for the implementation specifics of the Linux disk cache in a fashion that may not be directly portable to other operating systems.
Which begs the question: is this "everything is in cache" scenario a realistic benchmark for your target workload? Or haven't we already reached the limit where a real-world job would be mostly limited by HDD or SSD I/O?