Beginner: need help optimizing nested parallel iterators

Hi, I'm pretty new to Rust and I was trying to write a project to get my hands dirty with the language, really figure it out. I wanted to write a bioinformatics tool that uses a multiple sequence alignment file to calculate a pairwise distance metric between the sequences. My entire repository is here for reference: GitHub - theabhirath/pairsnp-rs: Pairwise SNP distance matrices from Multiple Sequence Alignments, written in Rust.

But I'm running into a specific issue when I profile this code. Specifically, compared to C++ code parallelized using MPI, one of my functions is very, very slow. I've added the code for this function below:

fn calculate_pairwise_snp_distances(
    a_snps: &[RoaringBitmap],
    c_snps: &[RoaringBitmap],
    g_snps: &[RoaringBitmap],
    t_snps: &[RoaringBitmap],
    nseqs: usize,
    seq_length: u64,
) -> Vec<Vec<u64>> {
    (0..nseqs)
        .into_par_iter()
        .map(|i| {
            (i + 1..nseqs)
                .into_par_iter()
                .map(|j| {
                    let mut res = &a_snps[i] & &a_snps[j];
                    res |= &c_snps[i] & &c_snps[j];
                    res |= &g_snps[i] & &g_snps[j];
                    res |= &t_snps[i] & &t_snps[j];
                    seq_length - res.len()
                })
                .collect()
        })
        .collect()
}

I'm using roaring-rs for faster bitmaps (Roaring Bitmaps) and rayon for parallelization using into_par_iter, but when I profile it I see that most of the time in this code is spent waiting and extending the result vector. Is there a more efficient way to write this sort of parallel code in Rust? Any help in optimizing the performance of this function would be appreciated!

The main thing is, don't parallelize the inner iterator. If the outer iterator has enough elements to use all threads, then there's no advantage to having more parallelization.

The inner iteration won't necessarily run in parallel unnecessarily, but it will still have been compiled in a way that makes it able to be split into smaller pieces and moved to another thread, which has some overhead in making that transfer possible. (Most rayon parallelism works this way: each job is written in a splittable way, and whenever a worker thread finds itself idle, it looks for some already-running job and says “give me your second half that you haven’t gotten to yet”.)

In some cases, when the shape of the data is not predictable and might have few items in the outer loop, it might make sense to make both levels parallel but use .par_chunks() for the inner loop, so as to get the advantages of simple serial processing (better throughput per core) but still allowing large data to be parallelized.