Hi there! I wrote this little program that uses BufReader wrapped around File to count the lines in a file. (The file is the GeoNames US placename database, which is ~270 MB and has ~2.2 million lines.)
use std::error::Error;
use std::fs::File;
use std::io::{BufRead, BufReader};
const DATA_FILE: &'static str = "dataset/US.txt";
fn main() {
let file = match File::open(DATA_FILE) {
Err(why) =>
panic!(
"couldn't open {}: {}",
DATA_FILE,
Error::description(&why)
),
Ok(file) => file,
};
let buffered_file = BufReader::new(file);
let mut lines = 0;
for line in buffered_file.lines() {
lines += 1;
}
println!("{} lines", lines);
}
It's really, really slow:
$ time ./target/debug/line-counter
2205986 lines
real 0m52.612s
user 0m52.516s
sys 0m0.058s
As a baseline, I wrote a naΓ―ve Python implementation, and that's 100x faster:
lines = 0
with open('dataset/US.txt') as myfile:
for line in myfile:
lines += 1
print "{} lines".format(lines)
$ time python linecounter.py
2205986 lines
real 0m0.432s
user 0m0.387s
sys 0m0.044s
Is there a bug in BufReader, or am I using it wrong here?
As an experiment, I tried a slighty un-rustic approach:
let mut lines = 0;
let mut buf: [u8; 4096*32] = [0; 4096*32];
loop {
let num_bytes = match file.read(&mut buf) {
Ok(s) => s,
Err(_) => break
};
if num_bytes == 0{ break; }
for i in 0..num_bytes {
if buf[i] == 0x0A { lines += 1}
}
}
This brings the runtime down to about 0.5 seconds (so roughly twice as fast, probably because it doesn't do any of UTF-8 handling that String has). But it's also several times slower than running wc -l (which on my machine runs consistently less than 0.15 seconds)
One major factor here is definitely optimizations not being turned on.
However, even with optimizations the Python version is still faster by about a factor of 4 on my system. According to perf Rust spends about half the time in str::from_utf8(), and the other half in main() (presumably in the inlined iteration/counting). So UTF-8 handling is definitely a factor here, the lines iterator could likely also use some love.
Python apparently pretty much just calls into memchr(), so one could almost argue we are really comparing Rust to C here, though Python does add some overhead.
First of all, yes, doing benchmarks without optimization is meaningless. It's wrong.
A sidenote is that .lines() allocates a new String per line, so it's not just UTF-8 handling but that too. Don't use .lines() unless you need those Strings!