Time for performance optimization

I played around with this for a little while.

Discovered with a bit of printf-profiling that your bottleneck is the deserialization of the csvs into Vec<Record> (aha, allocation!). Rewrote to remove the intermediate call to collect. That didn't do the trick. Looked a bit further, spotted that you are allocating two strings each time you deserialize a Record. Rewrote to use serdes zero-copy parsing. That didn't do much either!

Eventually I discovered that you are triggering some kind of pathological case in the csv crate. It seems to have occured because the CSVs are very wide (~80 columns) - csv appears to scan through the header every time when deserializing each row. Indeed perhaps there was some kind of O(n^2) logic going on inside the csv - serde handshake. (@BurntSushi?)

I managed to get a factor of 4 speedup by passing in the header explicitly to each deserialize call (through the StringRecord interface) but first pruning the header so it was only the necessary length (which turned out to be 5).

So, my advice is to pre-process all your CSVs with AWK to remove all but the first 5 rows :slight_smile:

My hacky code changes are below:

    files
        .into_par_iter()
        .filter_map(|pb| {
            println!("Processing file {:?}", pb);
            Reader::from_path(pb).ok()
        })
        .for_each(|mut reader| {
            let header = reader.headers().unwrap().clone().iter().take(5).collect();  // Ta-dah!
            for record in reader.records() {
                if let Ok(record) = record {
                    let r = record.deserialize::<Record>(Some(&header));
                    if let Ok(r) = r {
                        if r.failure == 0 {
                            drives.upsert(
                                r.serial.into(),
                                move || {
                                    Drive {
                                        model: Some(r.model.into()),
                                        capacity: r.capacity,
                                        life: 1,
                                    }
                                },
                                |d| d.life += 1,
                            );
                        }
                    }
                }
            }
        });

3 Likes