Time for performance optimization

adwhit · August 27, 2017, 2:37am

I played around with this for a little while.

Discovered with a bit of printf-profiling that your bottleneck is the deserialization of the csvs into Vec<Record> (aha, allocation!). Rewrote to remove the intermediate call to collect. That didn't do the trick. Looked a bit further, spotted that you are allocating two strings each time you deserialize a Record. Rewrote to use serdes zero-copy parsing. That didn't do much either!

Eventually I discovered that you are triggering some kind of pathological case in the csv crate. It seems to have occured because the CSVs are very wide (~80 columns) - csv appears to scan through the header every time when deserializing each row. Indeed perhaps there was some kind of O(n^2) logic going on inside the csv - serde handshake. (@BurntSushi?)

I managed to get a factor of 4 speedup by passing in the header explicitly to each deserialize call (through the StringRecord interface) but first pruning the header so it was only the necessary length (which turned out to be 5).

So, my advice is to pre-process all your CSVs with AWK to remove all but the first 5 rows

My hacky code changes are below:

    files
        .into_par_iter()
        .filter_map(|pb| {
            println!("Processing file {:?}", pb);
            Reader::from_path(pb).ok()
        })
        .for_each(|mut reader| {
            let header = reader.headers().unwrap().clone().iter().take(5).collect();  // Ta-dah!
            for record in reader.records() {
                if let Ok(record) = record {
                    let r = record.deserialize::<Record>(Some(&header));
                    if let Ok(r) = r {
                        if r.failure == 0 {
                            drives.upsert(
                                r.serial.into(),
                                move || {
                                    Drive {
                                        model: Some(r.model.into()),
                                        capacity: r.capacity,
                                        life: 1,
                                    }
                                },
                                |d| d.life += 1,
                            );
                        }
                    }
                }
            }
        });

Topic		Replies	Views
Please review my first crate :D! code review	4	502	September 4, 2022
First crate code review help	5	2226	January 12, 2023
Style/speed improvements in my code help	5	580	February 18, 2020
Improving `codegen_crate` and `link` Steps Compilation Performance in Rust	2	399	July 2, 2023
Rust crates worth reading? prefer < 2k loc; hard limit < 5k loc community	7	893	January 23, 2021

Time for performance optimization

Related Topics