I am working on my first Rust program, a basic bioinformatics tool that counts the number of k-mers, substrings of biological sequence data of length k, in a set of data in a fasta file.
I want to provide readable output that gives us the k-mer, the reverse complement of the k-mer, and the number of times the kmer appears in the fasta file of data. I've been outputting to a .tsv file by simply running my program from the command line and using >
.
Outputting is by far the slowest part of the program. Can anyone give me pointers on how to do outputting more easily?
Here is the part of my code that handles printing. The whole project is on Github:
use bio::{alphabets::dna::revcomp, io::fasta};
use dashmap::DashMap;
use rayon::{iter::ParallelBridge, prelude::*};
use std::{env, error::Error, fs::File, io::Write, str, time::Instant};
// ...
// fasta_hash is a DashMap<&[u8], Vec<u32>>
fasta_hash.into_iter().par_bridge().for_each(|(k, f)| {
// Convert k-mer bytes to str
let kmer = str::from_utf8(k).unwrap();
// Don't write k-mers containing 'N'
if kmer.contains('N') {
} else {
// Use bio (crate) revcomp to get k-mer reverse complement
let rvc: Vec<u8> = revcomp(k as &[u8]);
// Convert revcomp from bytes to str
let rvc = str::from_utf8(&rvc).unwrap();
let mut lck = stdout_ref.lock();
// Write (separated by tabs):
// k-mer
// reverse complement
// frequency across fasta file
writeln!(&mut lck, "{}\t{}\t{}", kmer, rvc, f.len()).expect("Couldn't write output");
}
});