Read large gzip file and gather lines starting with '@'

use std::{
    collections::HashSet,
    fs::{self, File},
    io::{BufRead, BufReader, Read},
    path::PathBuf,
    process::exit,
    sync::Mutex,
};

use flate2::read::GzDecoder;
use rayon::{prelude::*, ThreadPoolBuilder};

fn _gather_seq_names_in_fasta(fastq_path: &PathBuf, threads:i32) -> HashSet<String> {
    // Read fastq gz file.
    let file = File::open(fastq_path).unwrap();
    let read_capacity = 10 * 1024_usize.pow(2);
    let reader = BufReader::with_capacity(read_capacity, GzDecoder::new(file));

    let thread_pool = ThreadPoolBuilder::new().num_threads(threads as usize).build().unwrap();

    println!("Inner ThreadPool, num_threads={threads}");
    let seq_names = thread_pool.install(|| {
        reader.lines().par_bridge().filter_map(|lr| {
            let l = lr.unwrap();

            if l.starts_with("@") {
                Some(l.split(" ").next().unwrap().to_owned())
            } else {
                None
            }
        }).collect()

    });

    seq_names
}

Hi.
I'm trying to write a program which reads two large gzip files, records specific lines (starting with '@') to
HashSets per file and checks they are the same between the files.

A single gzip file size is about ~50G.

I wrote 2 versions of this, A is single thread version without multithreading and B is multithreading version (the above script).

I expected that B was faster than A, but it was slower. (B- 22 min / A - 11 min).

In the future, the input gzip files will be larger than it is now and I want to reduce the running time.

Could you take a look at my code? Thanks.

If your problem is I/O bound, then you aren't going to gain much from trying to parallelize the reading of a single file. Additional locking usually makes matters worse (as in your case). Try reading the two files in parallel instead.

Some sequencers compress fastq files with a specialized gzip encoding that contains multiple "streams". I don't know off the top of my head whether the streams can be addressed in a random access manner, but if so, you may have more success with seeking to pre-specified streams and reading those in parallel instead. You should still not use the same reader object for this – given that you are only reading the files, you can safely open it twice, then let the OS handle the synchronization (none of which should actually be necessary).

By the way, I'm pretty sure you should be using an already-existing, well-optimized tool for this job. See if eg. rgsam works for you.

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.