Optimize CSV reading

lln · November 15, 2022, 12:42am

Hi everyone,

I have the following code for parsing a tab-delimited file line-by-line and am convinced it's not the most optimal solution I can get:

use csv;
use std::path::Path;
use smartstring::{SmartString, LazyCompact};

#[derive(Debug, serde::Deserialize)]
pub struct BimRecord {
    chr: i32,
    snp: SmartString<LazyCompact>,
    pos: f64,
    bp: u32,
    a1: SmartString<LazyCompact>,
    a2: SmartString<LazyCompact>,
}


fn parse_bim_records<P>(chrom: i32, bim_path: &P) -> csv::Result<(Vec<SmartString<LazyCompact>>, Vec<i32>, Vec<i32>)>
    where
        P: AsRef<Path>,
{
    let mut found = false;

    let mut rdr = csv::ReaderBuilder::new()
        .has_headers(false)
        .delimiter(b'\t')
        .quoting(false)
        .from_path(bim_path.as_ref())?;

    let mut snp_vec = Vec::new();
    let mut a1_vec = Vec::new();
    let mut a2_vec = Vec::new();

    for record in rdr.deserialize() {
        let record: BimRecord = record?;

        if record.chr == chrom {
            found = true;
            snp_vec.push(record.snp);
            a1_vec.push(record.a1);
            a2_vec.push(record.a2);
        } else if found {
            break;
        }
    }

    Ok((snp_vec, a1_vec, a2_vec))
}

Can someone please provide suggestions on how the above can be improved to run faster?

Thanks!

Michael-F-Bryan · November 15, 2022, 4:56am

Did any of the suggestions in the Performance part of their docs help?

In particular, you don't need to deserialize into a BimRecord struct every time. You should be able to copy the "ammortized allocations" suggestion and reuse the same StringRecord every time, using indexing to get at the snp, a1, and a2 fields directly. Deserializing a strongly typed value with serde is always going to be more expensive than string indexing.

Coding-Badly · November 15, 2022, 6:08am

I do enjoy tweaking code to get the most out of a processor. But, reading from disk is going to be one to three orders of magnitude slower than parsing. I suspect your time is better spent tuning the read buffer size (2 MiB in my experience is a good starting choice) and introducing async, overlapped, or threaded I/O.

It looks like the default for csv::Reader is rather small. Increasing that just to the disk cluster size (64 KiB is common) should make a significant impact.

Strange the read buffer size wasn't mentioned.

BurntSushi · November 15, 2022, 11:53am

I didn't mention it because:

It usually doesn't matter much.
The default is going to be good enough for most cases. (64KB may indeed be a better choice and *should probably switch it.)
If you're reading from anything other than an HDD, csv parsing will be your bottleneck. Not I/O. (simdcsv is perhaps one exception I know of.)

Coding-Badly · November 15, 2022, 3:22pm

Yeah. That has not been my experience. As you probably suspect.

Well, except when reading from a network connection. Which also benefits from a tuned buffer size and overlapped I/O.

In any case, hopefully @lln will make progress and follow up with what worked and what didn't.

lln · November 16, 2022, 7:33pm

Yes, the suggestions are helpful. Amortizing allocations is what I'm working on incorporating next from the performance suggestions.

system · February 14, 2023, 7:34pm

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Parsing a tab-delimited file line-by-line help	15	2005	June 4, 2022
Transform StringRecord to String help	4	1913	April 15, 2021
Why these codes is 8x slower than python? help	8	510	March 5, 2023
Tips on how to optimize a file parser (std, diesel, serde) help	4	299	May 21, 2023
Optimizing string search code for large files help	12	2548	December 2, 2019

Optimize CSV reading

Related Topics