Writing to standard output > to file -- should it be so slow or am I missing something?

I am working on my first Rust program, a basic bioinformatics tool that counts the number of k-mers, substrings of biological sequence data of length k, in a set of data in a fasta file.

I want to provide readable output that gives us the k-mer, the reverse complement of the k-mer, and the number of times the kmer appears in the fasta file of data. I've been outputting to a .tsv file by simply running my program from the command line and using >.

Outputting is by far the slowest part of the program. Can anyone give me pointers on how to do outputting more easily?

Here is the part of my code that handles printing. The whole project is on Github:

use bio::{alphabets::dna::revcomp, io::fasta};
use dashmap::DashMap;
use rayon::{iter::ParallelBridge, prelude::*};
use std::{env, error::Error, fs::File, io::Write, str, time::Instant};
//  ...
//  fasta_hash is a DashMap<&[u8], Vec<u32>>
    fasta_hash.into_iter().par_bridge().for_each(|(k, f)| {
        //  Convert k-mer bytes to str
        let kmer = str::from_utf8(k).unwrap();
        //  Don't write k-mers containing 'N'
        if kmer.contains('N') {
        } else {
            //  Use bio (crate) revcomp to get k-mer reverse complement
            let rvc: Vec<u8> = revcomp(k as &[u8]);
            //  Convert revcomp from bytes to str
            let rvc = str::from_utf8(&rvc).unwrap();
          
            let mut lck = stdout_ref.lock();
            //  Write (separated by tabs):
            //        k-mer
            //        reverse complement
            //        frequency across fasta file
            writeln!(&mut lck, "{}\t{}\t{}", kmer, rvc, f.len()).expect("Couldn't write output");
        }
    });
1 Like

One problem is that stdout is line-buffered even when it is directed to a file. This leads to a potentially-expensive flush call after every writeln!.

You can work around this by wrapping stdout in a BufWriter. (For your parallel code, this might mean wrapping a StdoutLock in a Mutex<BufWriter<StdoutLock>> or something.) Or maybe you can change your program to write to a BufWriter<File> instead of stdout.

Another potential problem is that string formatting can be slow. Follow that link for some alternatives you can try, if your program is spending a lot of time formatting numbers into strings.

3 Likes

Thank you! Those links were really useful. My program is soooo much faster now!

For now, what I've done in the code below, following your suggestions, has sped things up no end and is working for my purposes.

I'm still learning about all the various ways I could've done this, so any further comments on this topic will be really welcome.


    let handle = &std::io::stdout();
    let mut buf = BufWriter::new(handle);
    //  fasta_hash is a DashMap<&[u8], Vec<u32>>
    fasta_hash.into_iter().for_each(|(k, f)| {   // got rid of doing things in parallel 
        //  Convert k-mer bytes to str           // still doing this for readable output
        let kmer = str::from_utf8(k).unwrap();    
        //  Don't write k-mers containing 'N'
        if k.contains(&b'N') {
        } else {
            //  Use bio (crate) revcomp to get k-mer reverse complement
            let rvc: Vec<u8> = revcomp(k as &[u8]);
            //  Convert revcomp from bytes to str
            let rvc: &str = str::from_utf8(&rvc).unwrap();
            //  Write (separated by tabs):
            //        k-mer
            //        reverse complement
            //        frequency across fasta file
	    writeln!(buf, "{}\t{}\t{}", kmer, rvc, f.len()).expect("Unable to write data");
        }
    });
    buf.flush().unwrap();

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.