I played around with this for a little while.
Discovered with a bit of printf-profiling that your bottleneck is the deserialization of the csvs into Vec<Record>
(aha, allocation!). Rewrote to remove the intermediate call to collect
. That didn't do the trick. Looked a bit further, spotted that you are allocating two strings each time you deserialize a Record
. Rewrote to use serde
s zero-copy parsing. That didn't do much either!
Eventually I discovered that you are triggering some kind of pathological case in the csv
crate. It seems to have occured because the CSVs are very wide (~80 columns) - csv
appears to scan through the header every time when deserializing each row. Indeed perhaps there was some kind of O(n^2) logic going on inside the csv
- serde
handshake. (@BurntSushi?)
I managed to get a factor of 4 speedup by passing in the header explicitly to each deserialize
call (through the StringRecord
interface) but first pruning the header so it was only the necessary length (which turned out to be 5).
So, my advice is to pre-process all your CSVs with AWK to remove all but the first 5 rows
My hacky code changes are below:
files
.into_par_iter()
.filter_map(|pb| {
println!("Processing file {:?}", pb);
Reader::from_path(pb).ok()
})
.for_each(|mut reader| {
let header = reader.headers().unwrap().clone().iter().take(5).collect(); // Ta-dah!
for record in reader.records() {
if let Ok(record) = record {
let r = record.deserialize::<Record>(Some(&header));
if let Ok(r) = r {
if r.failure == 0 {
drives.upsert(
r.serial.into(),
move || {
Drive {
model: Some(r.model.into()),
capacity: r.capacity,
life: 1,
}
},
|d| d.life += 1,
);
}
}
}
}
});