Parsing a tab-delimited file line-by-line

lln · March 5, 2022, 3:04am

Hi everyone,

I am trying to parse a tab-delimited (with 21,283,105 lines) file line-by-line, apply a filter, and create a HashMap. My Rust solution (release mode) takes about ~9.812s, while in Python it takes ~13.069s, and this is making me doubt my Rust solution (I'm new to the language).

Here's the code
Rust

use csv;
use std::path::Path;
use std::collections::HashMap;

#[derive(Debug, serde::Deserialize)]
struct BimRecord {
    chr: i32,
    snp: String,
    pos: f64,
    bp: u32,
    a1: String, // alleles are String and not char because of INDELs
    a2: String,
}

fn read_bim<P>(chrom: i32, bim_path: P) -> csv::Result<Vec<BimRecord>>
    where
        P: AsRef<Path>,
{
    let records = csv::ReaderBuilder::new()
        .has_headers(false)
        .delimiter(b'\t')
        .from_path(bim_path.as_ref())?
        .deserialize()
        .filter(|record: &csv::Result<BimRecord>| {
            record.as_ref().map_or(true, |r| r.chr == chrom)
        })
        .collect::<csv::Result<Vec<BimRecord>>>()?;

    Ok(records)
}


pub fn bim_hash<P>(chrom: i32, bim_path: P) -> HashMap<String, Vec<String>>
    where
        P: AsRef<Path>,
{
    let bim = read_bim(chrom, bim_path);
    let cont = bim.as_ref().unwrap().len();

    println!("The bim has {} variants", cont);

    let mut snp_vec = vec!["0".to_string(); cont];
    let mut a1_vec = vec!["0".to_string(); cont];
    let mut a2_vec = vec!["0".to_string(); cont];

    let mut num = 0;
    for rec in bim.unwrap(){
        snp_vec[num] = rec.snp;
        a1_vec[num] = rec.a1;
        a2_vec[num] = rec.a2;
        num += 1;
    }

    assert_eq!(snp_vec.len(), cont);
    assert_eq!(a1_vec.len(), cont);
    assert_eq!(a2_vec.len(), cont);

    let bim_hash = HashMap::from([
        ("SNP".to_string(), snp_vec),
        ("A1".to_string(), a1_vec),
        ("A2".to_string(), a2_vec),
    ]);

    bim_hash
}

fn main() {
    let bim = bim_hash(1, "/path/to//input.bim"); //1,653,512 lines are on chr 1
}

Python

def parse_bim(bim_file, chrom):
    bim_dict = {'SNP':[], 'A1':[], 'A2':[]}
    with open(bim_file ) as ff:
        for line in ff:
            ll = (line.strip()).split()
            if int(ll[0]) == chrom:
                bim_dict['SNP'].append(ll[1])
                bim_dict['A1'].append(ll[4])
                bim_dict['A2'].append(ll[5])


    return bim_dict

bim = parse_bim("/path/to//input.bim", 1)

Here's how the file looks like

22	rs9605903	0	17054720	C	T
22	rs5746647	0	17057138	G	T
22	rs5747999	0	17075353	C	A
22	rs2845380	0	17203103	A	G
22	rs2247281	0	17211075	G	A
22	rs2845346	0	17214252	C	T
22	rs2845347	0	17214669	C	T
22	rs1807512	0	17221495	C	T
22	rs5748593	0	17227461	T	C
22	rs9606468	0	17273728	C	T

Any help on how I can optimize my solution and write it better would be appreciated. Thanks!

scottmcm · March 5, 2022, 3:41am

Showing what typical lines look like would help.

lln:

    let mut snp_vec = vec!["0".to_string(); cont];
    let mut a1_vec = vec!["0".to_string(); cont];
    let mut a2_vec = vec!["0".to_string(); cont];

    let mut num = 0;
    for rec in bim.unwrap(){
        snp_vec[num] = rec.snp;
        a1_vec[num] = rec.a1;
        a2_vec[num] = rec.a2;
        num += 1;
    }

Note that those initializations are allocating 3xcont Strings on the heap, only to then replace them in the for loop.

One simple change (that I think would be more like the append in the python) would then be to do something like this:

    let mut snp_vec = vec![];
    let mut a1_vec = vec![];
    let mut a2_vec = vec![];

    for rec in bim.unwrap(){
        snp_vec.push(rec.snp);
        a1_vec.push(rec.a1);
        a2_vec.push(rec.a2);
    }

lln · March 5, 2022, 4:06am

Thanks! I've incorporated that. I was a bit hesitant on using push() as I was not sure which one is faster in Rust between a vector of known (pre-allocating) vs unknown length. But using push() seems to have made some difference.

scottmcm · March 5, 2022, 4:06am

I don't really know what an INDEL is, but if these are usually just a single base, then you could avoid a bunch of allocations for all these single-character strings.

For example, you could store it as

enum Allele {
    T,
    C,
    A,
    G,
    Indel(String)
}

and thus usually not need a string at all.

Or even go further and store the two of them together, assuming they're usually the normal base pairs. But that'd be a bigger change.

marcianx · March 5, 2022, 4:34am

You can modify @scottmcm's first solution to replace vec![] with Vec::with_capacity(cont) in the initialization to address that concern.

tttwang23 · March 5, 2022, 6:09am

Processing the file as byte array could be faster. Only the indels need to be in str data type. Each string creation has a valid utf-8 check overhead. Are you using the pos field? It is declared f64 yet I do not see floating point fields in the file.

H2CO3 · March 5, 2022, 7:21am

I would also check whether it isn't just reading the file that takes so long. Genomes have SNPs all over the place, so these files tend to be multi-GB. You are using a dedicated CSV reader in Rust, which is probably doing a lot of work under the hood in order to handle all sorts of general cases. In contrast, you are simply iterating over lines and splitting them on a tab character in Python. You should consider doing the same in Rust, using a buffered reader. (I'm not even sure if the CSV parser you are using provides buffering internally. If there is a setting for that, turn it on explicitly.)

lln · March 5, 2022, 2:38pm

I'm realizing now that the example data has pos as integers. It can either be an int or float and I wasn't sure how to represent that in a struct.

lln · March 5, 2022, 2:40pm

An INDEL in this my case is just multiple bases, so something like ATG, AT, CTGTC etc

lln · March 5, 2022, 2:44pm

Yeah, according to the documentation, it says the "The CSV reader does buffering internally". I will check how using a buffered reader performs. Thanks!

lln · March 5, 2022, 5:29pm

I have tried the following

use std::fs::File;
use std::io::{prelude::*, BufReader};
use std::collections::HashMap;
use std::error::Error;


pub fn bim_hash(chrom: i32, bim_path: &str) -> Result<HashMap<String,Vec<String>>, Box<dyn Error>> {
    let file = File::open(bim_path)?;
    let reader = BufReader::new(file);

    let mut snp_vec = vec![];
    let mut a1_vec = vec![];
    let mut a2_vec = vec![];

    for line in reader.lines() {
        let x = line?;
        let ll = (x.as_str()).split("\t").collect::<Vec<&str>>();

        if ll[0].parse::<i32>().unwrap() == chrom {
            snp_vec.push(ll[1].to_string());
            a1_vec.push(ll[4].to_string());
            a2_vec.push(ll[5].to_string());
        }
    }

    let bim_hash = HashMap::from([
        ("SNP".to_string(), snp_vec),
        ("A1".to_string(), a1_vec),
        ("A2".to_string(), a2_vec),
    ]);


    Ok(bim_hash)

}

fn main() {
    let bim = bim_hash(1, "/path/tofile.bim");

}

and it takes almost the same time as the Python code to run. This is making me doubt my implementation.

erelde · March 5, 2022, 8:58pm

You're probably running in debug mode, try cargo run --release.

You could avoid that temporary allocation by using the next and nth methods, since it consumes the iterator you'd have to adjust the numbers a bit :

let ll = x.as_str().split("\t");
...
ll.next()
ll.next()
ll.nth(2)
ll.next()

elsuizo · March 5, 2022, 9:24pm

what about something like this:

#[derive(Hash, Debug, Eq, PartialEq)]
pub enum Variants {
    A1,
    A2,
    SNP,
}

impl Variants {
    fn from_str(input: &str) -> Self {
        match input {
            "A1" => Self::A1,
            "A2" => Self::A2,
            "SNP" => Self::SNP,
            _ => panic!("error code"),
        }
    }
}

pub fn bim_hash(
    chrom: i32,
    bim_path: &str,
) -> Result<HashMap<Variants, Vec<Variants>>, Box<dyn Error>> {
    let file = File::open(bim_path)?;
    let reader = BufReader::new(file);

    let mut snp_vec = vec![];
    let mut a1_vec = vec![];
    let mut a2_vec = vec![];

    for line in reader.lines() {
        let x = line?;
        let ll = (x.as_str()).split("\t").collect::<Vec<&str>>();

        if ll[0].parse::<i32>().unwrap() == chrom {
            snp_vec.push(Variants::from_str(ll[1]));
            a1_vec.push(Variants::from_str(ll[4]));
            a2_vec.push(Variants::from_str(ll[5]));
        }
    }

    let bim_hash = HashMap::from([
        (Variants::SNP, snp_vec),
        (Variants::A1, a1_vec),
        (Variants::A1, a2_vec),
    ]);

    Ok(bim_hash)
}

lln · March 5, 2022, 9:41pm

I am actually running in release mode.

I will try using the next and nth methods, thanks!

tttwang23 · March 6, 2022, 12:11am

I think this is 2x faster. Mainly avoid creating string object if not needed. I bet you can speed it up a bit more if you can live with Vec of u8 instead of String.

pub fn bim_hash2(chrom: i32, bim_path: &str)
  -> Result<(Vec<String>,Vec<Allele>,Vec<Allele>),Box<dyn Error>> {
    let file = File::open(bim_path)?;
    let mut reader = BufReader::with_capacity(1024 * 128, file);
    let mut snp_vec:Vec<String> = Vec::with_capacity(500000);
    let mut a1_vec:Vec<Allele> = Vec::with_capacity(500000);
    let mut a2_vec:Vec<Allele> = Vec::with_capacity(500000);
	let mut linebuf:Vec<u8> = Vec::with_capacity(256);
    loop {
		linebuf.clear();
		let num_bytes:usize = reader.read_until(b'\n', &mut linebuf)?;
		if num_bytes == 0 {
			break;
		}
		let mut r = BimRecord {
			chroffset: 0, chrlen: 0,
			snpoffset: 0, snplen: 0,
			posoffset: 0,poslen: 0,
			bpoffset: 0, bplen: 0,
			a1offset: 0, a1len: 0,
			a2offset: 0, a2len: 0,
		};
		let mut fieldno:usize = 0;
		let mut offset:usize = 0;
		for uc in linebuf.iter() {
			offset = offset + 1;
			if *uc == b'\t' {
				fieldno = fieldno + 1;
				match fieldno {
					1 => { r.snpoffset = offset; }
					2 => { r.posoffset = offset; }
					3 => { r.bpoffset = offset; }
					4 => { r.a1offset = offset; }
					5 => { r.a2offset = offset; }
					_ => { () }
				}
			}
			else if *uc != b'\n' {
				match fieldno {
					0 => { r.chrlen = r.chrlen + 1; }
					1 => { r.snplen = r.snplen + 1; }
					2 => { r.poslen = r.poslen + 1; }
					3 => { r.bplen = r.bplen + 1; }
					4 => { r.a1len = r.a1len + 1; }
					5 => { r.a2len = r.a2len + 1; }
					_ => { () }
				}
			}
		}
		let chrslice = &linebuf[r.chroffset .. r.chroffset + r.chrlen];
		let chrstring = String::from_utf8_lossy(chrslice);
		let chr_num:i32 = chrstring.parse::<i32>()?;
		if chr_num == chrom {
			let snpslice = &linebuf[r.snpoffset .. r.snpoffset + r.snplen];
			let snp:String = String::from_utf8_lossy(snpslice).to_string();
			let a1allele = match r.a1len {
				1 => {
					if linebuf[r.a1offset] == b'T' { Ok(Allele::T) }
					else if linebuf[r.a1offset] == b'C' { Ok(Allele::C) }
					else if linebuf[r.a1offset] == b'A' { Ok(Allele::A) }
					else if linebuf[r.a1offset] == b'G' { Ok(Allele::G) }
					else {
						let custom_error = std::io::Error::new(ErrorKind::InvalidData,"bad allele");
						Err(custom_error)
					}
				}
				_ => {
					let a1slice = &linebuf[r.a1offset .. r.a1offset + r.a1len];
					let a1string:String = String::from_utf8_lossy(a1slice).to_string();
					Ok(Allele::Indel(a1string))
				}
			} ?;
			let a2allele = match r.a2len {
				1 => {
					if linebuf[r.a2offset] == b'T' { Ok(Allele::T) }
					else if linebuf[r.a2offset] == b'C' { Ok(Allele::C) }
					else if linebuf[r.a2offset] == b'A' { Ok(Allele::A) }
					else if linebuf[r.a2offset] == b'G' { Ok(Allele::G) }
					else {
						let custom_error = std::io::Error::new(ErrorKind::InvalidData,"bad allele");
						Err(custom_error)
					}
				}
				_ => {
					let a2slice = &linebuf[r.a2offset .. r.a2offset + r.a2len];
					let a2string:String = String::from_utf8_lossy(a2slice).to_string();
					Ok(Allele::Indel(a2string))
				}
			} ?;
			snp_vec.push(snp);
			a1_vec.push(a1allele);
			a2_vec.push(a2allele);
		}
	}
	Ok((snp_vec, a1_vec, a2_vec))
}

system · June 4, 2022, 12:12am

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Optimize CSV reading code review	6	458	February 14, 2023
Rust vs Go string manipulation -- performance help	58	5501	April 24, 2020
Rust IO speed versus other languages help	15	4510	January 12, 2023
Looking for feedback on my 80 lines of code help	9	762	October 13, 2022
Hash digest performance (Rust vs. Python) code review	19	2641	May 22, 2023

Parsing a tab-delimited file line-by-line

Related Topics