I've been looking to improve Rust's performance on the Shootout benchmarks, and I'm currently looking at the reverse-complement test. It essentially boils down to reading in a large amount of data from stdin, and iterating over it backwards. The first thing I've noticed is that Rust is getting creamed in significant part just due to the amount of time taken to read in the data. Compare the following programs:
C (cut down from the shootout):
int main() {
size_t buflen = 1024, len, end = 0;
char *buf = malloc(1024);
int in = fileno(stdin);
while ((len = read(in, buf + end, buflen - end))) {
end += len;
if (end < buflen) break;
buf = realloc(buf, buflen *= 2);
}
printf("%lu", end);
}
Rust
fn main() {
let mut stdin = std::io::stdin();
let mut data = Vec::with_capacity(1024);
stdin.read_to_end(&mut data).unwrap();
println!("{}",data.len());
}
With both compiled with optimisations on, run over a 250MB file, the C code runs in 0.16s, and the Rust code in 0.5. If I increase the buffer on the Rust program to 300MB to avoid any allocations, it still takes 0.36s.
If, on the other hand, I initialise a small buffer early on, and then just overwrite its contents (excuse the poor error handling)...
fn main() {
let mut stdin = std::io::stdin();
let mut data = [0u8; 100000];
let mut len = 0;
while let Ok(n) = stdin.read(&mut data) {
if n == 0 {
break;
}
len += n;
}
println!("{}", len);
}
...it takes just 0.08s to read the file.
My conclusion from this is that the work needed to zero-initialise the buffer is hurting us. Indeed, if I run the following program, it takes 0.6s to run, without even reading from stdin!
fn main() {
let mut stdin = std::io::stdin();
let mut data = Vec::with_capacity(1024*1024*250);
data.extend(iter::repeat(0).take(1024*1024*250));
println!("{}", data.len());
}
So, all this said, is there a quick way to read a large buffer from stdin (or from a file)? Reusing the same buffer repeatedly as in my third code block isn't practical here, because I need to work backwards from the end of the file.
If there isn't such a way, is something likely to get added to the stdlib to do so? Or possibly to allocate a large zeroed buffer very fast? It would seem like something that would be a substantial benefit in this case.