Parsing a tab-delimited file line-by-line

Hi everyone,

I am trying to parse a tab-delimited (with 21,283,105 lines) file line-by-line, apply a filter, and create a HashMap. My Rust solution (release mode) takes about ~9.812s, while in Python it takes ~13.069s, and this is making me doubt my Rust solution (I'm new to the language).

Here's the code
Rust

use csv;
use std::path::Path;
use std::collections::HashMap;

#[derive(Debug, serde::Deserialize)]
struct BimRecord {
    chr: i32,
    snp: String,
    pos: f64,
    bp: u32,
    a1: String, // alleles are String and not char because of INDELs
    a2: String,
}

fn read_bim<P>(chrom: i32, bim_path: P) -> csv::Result<Vec<BimRecord>>
    where
        P: AsRef<Path>,
{
    let records = csv::ReaderBuilder::new()
        .has_headers(false)
        .delimiter(b'\t')
        .from_path(bim_path.as_ref())?
        .deserialize()
        .filter(|record: &csv::Result<BimRecord>| {
            record.as_ref().map_or(true, |r| r.chr == chrom)
        })
        .collect::<csv::Result<Vec<BimRecord>>>()?;

    Ok(records)
}


pub fn bim_hash<P>(chrom: i32, bim_path: P) -> HashMap<String, Vec<String>>
    where
        P: AsRef<Path>,
{
    let bim = read_bim(chrom, bim_path);
    let cont = bim.as_ref().unwrap().len();

    println!("The bim has {} variants", cont);

    let mut snp_vec = vec!["0".to_string(); cont];
    let mut a1_vec = vec!["0".to_string(); cont];
    let mut a2_vec = vec!["0".to_string(); cont];

    let mut num = 0;
    for rec in bim.unwrap(){
        snp_vec[num] = rec.snp;
        a1_vec[num] = rec.a1;
        a2_vec[num] = rec.a2;
        num += 1;
    }

    assert_eq!(snp_vec.len(), cont);
    assert_eq!(a1_vec.len(), cont);
    assert_eq!(a2_vec.len(), cont);

    let bim_hash = HashMap::from([
        ("SNP".to_string(), snp_vec),
        ("A1".to_string(), a1_vec),
        ("A2".to_string(), a2_vec),
    ]);

    bim_hash
}

fn main() {
    let bim = bim_hash(1, "/path/to//input.bim"); //1,653,512 lines are on chr 1
}

Python

def parse_bim(bim_file, chrom):
    bim_dict = {'SNP':[], 'A1':[], 'A2':[]}
    with open(bim_file ) as ff:
        for line in ff:
            ll = (line.strip()).split()
            if int(ll[0]) == chrom:
                bim_dict['SNP'].append(ll[1])
                bim_dict['A1'].append(ll[4])
                bim_dict['A2'].append(ll[5])


    return bim_dict

bim = parse_bim("/path/to//input.bim", 1)

Here's how the file looks like

22	rs9605903	0	17054720	C	T
22	rs5746647	0	17057138	G	T
22	rs5747999	0	17075353	C	A
22	rs2845380	0	17203103	A	G
22	rs2247281	0	17211075	G	A
22	rs2845346	0	17214252	C	T
22	rs2845347	0	17214669	C	T
22	rs1807512	0	17221495	C	T
22	rs5748593	0	17227461	T	C
22	rs9606468	0	17273728	C	T

Any help on how I can optimize my solution and write it better would be appreciated. Thanks!

Showing what typical lines look like would help.

Note that those initializations are allocating 3xcont Strings on the heap, only to then replace them in the for loop.

One simple change (that I think would be more like the append in the python) would then be to do something like this:

    let mut snp_vec = vec![];
    let mut a1_vec = vec![];
    let mut a2_vec = vec![];

    for rec in bim.unwrap(){
        snp_vec.push(rec.snp);
        a1_vec.push(rec.a1);
        a2_vec.push(rec.a2);
    }
2 Likes

Thanks! I've incorporated that. I was a bit hesitant on using push() as I was not sure which one is faster in Rust between a vector of known (pre-allocating) vs unknown length. But using push() seems to have made some difference.

I don't really know what an INDEL is, but if these are usually just a single base, then you could avoid a bunch of allocations for all these single-character strings.

For example, you could store it as

enum Allele {
    T,
    C,
    A,
    G,
    Indel(String)
}

and thus usually not need a string at all.

Or even go further and store the two of them together, assuming they're usually the normal base pairs. But that'd be a bigger change.

1 Like

You can modify @scottmcm's first solution to replace vec![] with Vec::with_capacity(cont) in the initialization to address that concern.

1 Like

Processing the file as byte array could be faster. Only the indels need to be in str data type. Each string creation has a valid utf-8 check overhead. Are you using the pos field? It is declared f64 yet I do not see floating point fields in the file.

1 Like

I would also check whether it isn't just reading the file that takes so long. Genomes have SNPs all over the place, so these files tend to be multi-GB. You are using a dedicated CSV reader in Rust, which is probably doing a lot of work under the hood in order to handle all sorts of general cases. In contrast, you are simply iterating over lines and splitting them on a tab character in Python. You should consider doing the same in Rust, using a buffered reader. (I'm not even sure if the CSV parser you are using provides buffering internally. If there is a setting for that, turn it on explicitly.)

1 Like

I'm realizing now that the example data has pos as integers. It can either be an int or float and I wasn't sure how to represent that in a struct.

An INDEL in this my case is just multiple bases, so something like ATG, AT, CTGTC etc

Yeah, according to the documentation, it says the "The CSV reader does buffering internally". I will check how using a buffered reader performs. Thanks!

I have tried the following

use std::fs::File;
use std::io::{prelude::*, BufReader};
use std::collections::HashMap;
use std::error::Error;


pub fn bim_hash(chrom: i32, bim_path: &str) -> Result<HashMap<String,Vec<String>>, Box<dyn Error>> {
    let file = File::open(bim_path)?;
    let reader = BufReader::new(file);

    let mut snp_vec = vec![];
    let mut a1_vec = vec![];
    let mut a2_vec = vec![];

    for line in reader.lines() {
        let x = line?;
        let ll = (x.as_str()).split("\t").collect::<Vec<&str>>();

        if ll[0].parse::<i32>().unwrap() == chrom {
            snp_vec.push(ll[1].to_string());
            a1_vec.push(ll[4].to_string());
            a2_vec.push(ll[5].to_string());
        }
    }

    let bim_hash = HashMap::from([
        ("SNP".to_string(), snp_vec),
        ("A1".to_string(), a1_vec),
        ("A2".to_string(), a2_vec),
    ]);


    Ok(bim_hash)

}

fn main() {
    let bim = bim_hash(1, "/path/tofile.bim");

}

and it takes almost the same time as the Python code to run. This is making me doubt my implementation.

You're probably running in debug mode, try cargo run --release.

You could avoid that temporary allocation by using the next and nth methods, since it consumes the iterator you'd have to adjust the numbers a bit :

let ll = x.as_str().split("\t");
...
ll.next()
ll.next()
ll.nth(2)
ll.next()

what about something like this:

#[derive(Hash, Debug, Eq, PartialEq)]
pub enum Variants {
    A1,
    A2,
    SNP,
}

impl Variants {
    fn from_str(input: &str) -> Self {
        match input {
            "A1" => Self::A1,
            "A2" => Self::A2,
            "SNP" => Self::SNP,
            _ => panic!("error code"),
        }
    }
}

pub fn bim_hash(
    chrom: i32,
    bim_path: &str,
) -> Result<HashMap<Variants, Vec<Variants>>, Box<dyn Error>> {
    let file = File::open(bim_path)?;
    let reader = BufReader::new(file);

    let mut snp_vec = vec![];
    let mut a1_vec = vec![];
    let mut a2_vec = vec![];

    for line in reader.lines() {
        let x = line?;
        let ll = (x.as_str()).split("\t").collect::<Vec<&str>>();

        if ll[0].parse::<i32>().unwrap() == chrom {
            snp_vec.push(Variants::from_str(ll[1]));
            a1_vec.push(Variants::from_str(ll[4]));
            a2_vec.push(Variants::from_str(ll[5]));
        }
    }

    let bim_hash = HashMap::from([
        (Variants::SNP, snp_vec),
        (Variants::A1, a1_vec),
        (Variants::A1, a2_vec),
    ]);

    Ok(bim_hash)
}
1 Like

I am actually running in release mode.

I will try using the next and nth methods, thanks!

I think this is 2x faster. Mainly avoid creating string object if not needed. I bet you can speed it up a bit more if you can live with Vec of u8 instead of String.

pub fn bim_hash2(chrom: i32, bim_path: &str)
  -> Result<(Vec<String>,Vec<Allele>,Vec<Allele>),Box<dyn Error>> {
    let file = File::open(bim_path)?;
    let mut reader = BufReader::with_capacity(1024 * 128, file);
    let mut snp_vec:Vec<String> = Vec::with_capacity(500000);
    let mut a1_vec:Vec<Allele> = Vec::with_capacity(500000);
    let mut a2_vec:Vec<Allele> = Vec::with_capacity(500000);
	let mut linebuf:Vec<u8> = Vec::with_capacity(256);
    loop {
		linebuf.clear();
		let num_bytes:usize = reader.read_until(b'\n', &mut linebuf)?;
		if num_bytes == 0 {
			break;
		}
		let mut r = BimRecord {
			chroffset: 0, chrlen: 0,
			snpoffset: 0, snplen: 0,
			posoffset: 0,poslen: 0,
			bpoffset: 0, bplen: 0,
			a1offset: 0, a1len: 0,
			a2offset: 0, a2len: 0,
		};
		let mut fieldno:usize = 0;
		let mut offset:usize = 0;
		for uc in linebuf.iter() {
			offset = offset + 1;
			if *uc == b'\t' {
				fieldno = fieldno + 1;
				match fieldno {
					1 => { r.snpoffset = offset; }
					2 => { r.posoffset = offset; }
					3 => { r.bpoffset = offset; }
					4 => { r.a1offset = offset; }
					5 => { r.a2offset = offset; }
					_ => { () }
				}
			}
			else if *uc != b'\n' {
				match fieldno {
					0 => { r.chrlen = r.chrlen + 1; }
					1 => { r.snplen = r.snplen + 1; }
					2 => { r.poslen = r.poslen + 1; }
					3 => { r.bplen = r.bplen + 1; }
					4 => { r.a1len = r.a1len + 1; }
					5 => { r.a2len = r.a2len + 1; }
					_ => { () }
				}
			}
		}
		let chrslice = &linebuf[r.chroffset .. r.chroffset + r.chrlen];
		let chrstring = String::from_utf8_lossy(chrslice);
		let chr_num:i32 = chrstring.parse::<i32>()?;
		if chr_num == chrom {
			let snpslice = &linebuf[r.snpoffset .. r.snpoffset + r.snplen];
			let snp:String = String::from_utf8_lossy(snpslice).to_string();
			let a1allele = match r.a1len {
				1 => {
					if linebuf[r.a1offset] == b'T' { Ok(Allele::T) }
					else if linebuf[r.a1offset] == b'C' { Ok(Allele::C) }
					else if linebuf[r.a1offset] == b'A' { Ok(Allele::A) }
					else if linebuf[r.a1offset] == b'G' { Ok(Allele::G) }
					else {
						let custom_error = std::io::Error::new(ErrorKind::InvalidData,"bad allele");
						Err(custom_error)
					}
				}
				_ => {
					let a1slice = &linebuf[r.a1offset .. r.a1offset + r.a1len];
					let a1string:String = String::from_utf8_lossy(a1slice).to_string();
					Ok(Allele::Indel(a1string))
				}
			} ?;
			let a2allele = match r.a2len {
				1 => {
					if linebuf[r.a2offset] == b'T' { Ok(Allele::T) }
					else if linebuf[r.a2offset] == b'C' { Ok(Allele::C) }
					else if linebuf[r.a2offset] == b'A' { Ok(Allele::A) }
					else if linebuf[r.a2offset] == b'G' { Ok(Allele::G) }
					else {
						let custom_error = std::io::Error::new(ErrorKind::InvalidData,"bad allele");
						Err(custom_error)
					}
				}
				_ => {
					let a2slice = &linebuf[r.a2offset .. r.a2offset + r.a2len];
					let a2string:String = String::from_utf8_lossy(a2slice).to_string();
					Ok(Allele::Indel(a2string))
				}
			} ?;
			snp_vec.push(snp);
			a1_vec.push(a1allele);
			a2_vec.push(a2allele);
		}
	}
	Ok((snp_vec, a1_vec, a2_vec))
}
1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.