Suggested type for importing from CSV

Xpyder · September 6, 2018, 12:03am

I'm working on a template for importing and working with CSV files. I've got serde and csv working, but I've landed on the following type and I'm not sure putting a bunch of hashmaps inside a vec is the optimal approach. If it is, great! But something tells me this may end up being slow when processing and there might be a built in object that's better suited to this task.

Is this a reasonable object type or is there something better I should be using?

Vec<
 Hashmap< String, Option<String> >
>

Constraints:

I want to ignore but preserve the columns I don't use (otherwise I'd use an explicit struct for serde)
I want to be able to work with a few columns based on their column name, irrespective of the column order. i.e. if someone changes the order but keeps the headers correct I still want it to work (thus the hashmap)
I want to be able to iterate through the rows (thus the vec)

//snip /*uses csv, serde*/
pub fn csv_from_path(filename: &str) -> Result<Vec<HashMap<String,Option<String>>>,Box<Error>> {
    let mut output: Vec<HashMap<String,Option<String>>> = Vec::new();
    let mut digester = csv::Reader::from_path(filename)?;
    for row in digester.deserialize() {
        let record: HashMap<String,Option<String>>  = row?;
        output.push(record);
        //println!("{:?}",record)
    }
    Ok(output)
}
//snip

I can post full source code, but I'm designing it as a template to be used for other projects so it's not really a minimal code example.

BurntSushi · September 6, 2018, 12:32am

I think you're barking up the right tree, but there really isn't enough data to answer your query here. What matters isn't whether using a HashMap for a every record is slow or not, but whether it's too slow and how much tolerance you have for increased code complexity. So my question to you is: have you tried the approach you have now, and is it actually too slow? Or are you just guessing?

If you just want everything to work with somewhat reasonable performance, then I think your current approach is pretty good. It's roughly analogous to, say, using Python's DictReader for example, and that's fast enough for a lot cases.

The next step to take is to use something like HashMap<&str, Option<&str>> instead. This saves a little bit of allocation by reusing the contents of the record for each key/value instead of creating separate allocations for them. The critical downside of this approach is that you can't just collect them into a Vec like you're doing now, because the keys/values of the HashMap would itself be tied to the record, and in your code, that record is (implicitly) reused on each iteration.

I suspect the best thing you can do is to create a HashMap<String, usize> once for the headers, which maps header name to its index. Then you can avoid Serde entirely and just pass around a Vec<StringRecord> (or Vec<ByteRecord>). The downside here is that you have a bit of extra indirection by needing to look at the hashmap before looking up the corresponding field in the raw record. This doesn't cost much more than the access time you're already paying by using HashMap<String, Option<String>>, but it does add code complexity. I suspect you could wrap that up though by defining your own record type:

pub struct NameRecord {
    map: Arc<HashMap<String, usize>>,
    record: StringRecord,
}

impl NameRecord {
    pub fn get(&self, field: &str) -> &str {
        &self.record[self.map[field]]
    }
}

You might need additional methods depending on your use case, but I think it should be straight-forward to add them from there. The key trick here is putting the hashmap in an Arc so that you only ever need to create one of them for each CSV file you open. You do still need a separate allocation for each record, but a StringRecord has an amortized constant number of allocations (i.e., not proportional to the number of fields), so it should work pretty decently.

An argument could be made that such a record type should probably be a first class feature of the csv crate itself. I've thought about it briefly, but haven't had much time to really dig into the details.

Xpyder · September 6, 2018, 6:11pm

This is what I was looking for. It would also let me cache the matching column indexes moving an entire set of the derefs to once at the outset rather than again for each record. I could see the shape of the structure in my mind I just couldn't figure out how to build it. Thanks!

dcarosone · September 6, 2018, 6:23pm

I'm going the flagrantly ignore that you already have the authoritative answer you were looking for, and point out another option: serde's flatten attribute.

You could thus have an explicit struct for the fields you want, and serde can collect the extras into a hashmap field. Super convenient.

https://serde.rs/attr-flatten.html#capture-additional-fields

Xpyder · September 6, 2018, 6:49pm

That also seems like a good approach, and has the potential to be more adaptable for other serde digesters, if it's easy to use. I'll look into it (^-^)

Topic		Replies	Views
Deserialising multiple types from CSV (with serde) help	7	2424	May 18, 2020
Csv + serde vs non-utf8 (easily) help	5	1420	January 12, 2023
CSV: Reading a fixed sized field into an array of char help	14	1002	May 24, 2021
How to get first column from library rust-csv? help	6	2899	January 12, 2023
Hashmaps in rust	2	586	October 1, 2020

Suggested type for importing from CSV

Related topics