First, did you compile it in release mode? That can have a big impact on performance, cargo build --release
.
Once you've done that, if its still very slow, here are some steps that could tune this code better.
Buffering output
The clearest way to improve the performance of this code, to me, is in how you are handling writing. By calling println!
in every loop iteration, you perform a separate write call for each line that you print. Just as you've wrapped your reads in a BufReader
, it makes more sense to wrap your writes in a BufWriter
. The BufWriter
will write once, when it is dropped (at the end of your main function, in this case).
Here I have done that, creating a BufWriter::new(io::stdout())
and then calling the writeln!
macro instead of the println!
macro.
use std::io;
use std::io::BufReader;
use std::io::BufRead;
use std::io::BufWriter;
use std::io::Write;
use std::fs::File;
fn main() {
let f = File::open("data/SRR062634.filt.fastq").unwrap();
let file = BufReader::new(&f);
let mut writer = BufWriter::new(io::stdout());
for (num, line) in file.lines().enumerate() {
let l = line.unwrap();
if num % 4 == 0 {
let chars: String = l.chars().skip(1).collect();
writeln!(writer, ">{}", chars).unwrap();
}
if num % 4 == 1 {
writeln!(writer, "{}", l).unwrap();
}
}
}
Remove an unnecessary allocation
Another thing that is likely to be a performance problem, especially if your lines are very long, is this section here:
let chars: String = l.chars().skip(1).collect();
writeln!(writer, ">{}", chars).unwrap();
By creating a new String
, you copy all of the chars from the original string (except the first one) into a new buffer. This is a new allocation for every line in the file, and depending on how long those files are, possibly a large one. Its also not necessary at all.
Because you know the first character is always an @
, you know that the first character always takes up exactly 1 byte. So you can simply slice the string, starting from the second byte:
writeln!(writer, ">{}", &l[1..]).unwrap()
If your first character could be any arbitrary unicode character, there are other ways to get the index after the first char.
Writing bytes instead of using format strings
In both the writeln
and println
macros, you take a formatting string, and then perform string interpolation. You don't need to go through the string interpolation, because what you're printing is already stringified data. You can convert your string to bytes and use the write method, instead.
use std::io;
use std::io::BufReader;
use std::io::BufRead;
use std::io::BufWriter;
use std::io::Write;
use std::fs::File;
fn main() {
let f = File::open("data/SRR062634.filt.fastq").unwrap();
let file = BufReader::new(&f);
let mut writer = BufWriter::new(io::stdout());
for (num, line) in file.lines().enumerate() {
let l = line.unwrap();
if num % 4 == 0 {
writer.write(b">").unwrap();
writer.write((&l[1..]).as_bytes()).unwrap();
writer.write(b"\n").unwrap();
}
if num % 4 == 1 {
writer.write(l.as_bytes()).unwrap();
writer.write(b"\n").unwrap();
}
}
}
If you do this, you notice you have to write the newline character yourself. (By the way, note that b""
is a byte string, which means it has to be ASCII values, not any Unicode characters). This is much less ergonomic than just using writeln!
, and the performance advantage isn't necessarily huge, so I recommend only making this change if, after the other changes, your performance is still a problem.