Optimize CSV reading

Hi everyone,

I have the following code for parsing a tab-delimited file line-by-line and am convinced it's not the most optimal solution I can get:

use csv;
use std::path::Path;
use smartstring::{SmartString, LazyCompact};

#[derive(Debug, serde::Deserialize)]
pub struct BimRecord {
    chr: i32,
    snp: SmartString<LazyCompact>,
    pos: f64,
    bp: u32,
    a1: SmartString<LazyCompact>,
    a2: SmartString<LazyCompact>,
}


fn parse_bim_records<P>(chrom: i32, bim_path: &P) -> csv::Result<(Vec<SmartString<LazyCompact>>, Vec<i32>, Vec<i32>)>
    where
        P: AsRef<Path>,
{
    let mut found = false;

    let mut rdr = csv::ReaderBuilder::new()
        .has_headers(false)
        .delimiter(b'\t')
        .quoting(false)
        .from_path(bim_path.as_ref())?;

    let mut snp_vec = Vec::new();
    let mut a1_vec = Vec::new();
    let mut a2_vec = Vec::new();

    for record in rdr.deserialize() {
        let record: BimRecord = record?;

        if record.chr == chrom {
            found = true;
            snp_vec.push(record.snp);
            a1_vec.push(record.a1);
            a2_vec.push(record.a2);
        } else if found {
            break;
        }
    }

    Ok((snp_vec, a1_vec, a2_vec))
}

Can someone please provide suggestions on how the above can be improved to run faster?

Thanks!

Did any of the suggestions in the Performance part of their docs help?

In particular, you don't need to deserialize into a BimRecord struct every time. You should be able to copy the "ammortized allocations" suggestion and reuse the same StringRecord every time, using indexing to get at the snp, a1, and a2 fields directly. Deserializing a strongly typed value with serde is always going to be more expensive than string indexing.

I do enjoy tweaking code to get the most out of a processor. But, reading from disk is going to be one to three orders of magnitude slower than parsing. I suspect your time is better spent tuning the read buffer size (2 MiB in my experience is a good starting choice) and introducing async, overlapped, or threaded I/O.

It looks like the default for csv::Reader is rather small. Increasing that just to the disk cluster size (64 KiB is common) should make a significant impact.

Strange the read buffer size wasn't mentioned.

1 Like

I didn't mention it because:

  1. It usually doesn't matter much.
  2. The default is going to be good enough for most cases. (64KB may indeed be a better choice and *should probably switch it.)
  3. If you're reading from anything other than an HDD, csv parsing will be your bottleneck. Not I/O. (simdcsv is perhaps one exception I know of.)
1 Like

Yeah. That has not been my experience. As you probably suspect.

Well, except when reading from a network connection. Which also benefits from a tuned buffer size and overlapped I/O. :wink:

In any case, hopefully @lln will make progress and follow up with what worked and what didn't.

1 Like

Yes, the suggestions are helpful. Amortizing allocations is what I'm working on incorporating next from the performance suggestions.

2 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.