Structuring CSV Data in Rust

Question:
I'm working on a Rust project where I need to parse CSV data, and I'm unsure about the best approach to structure the data. I'm torn between two options, and I'd appreciate your insights and feedback on which one would be more suitable for my use case.

Option 1: Struct for Each Record with a Vector of Structs
I can create a struct for each record in the CSV, where each struct represents a record with its fields. To group these records, I can store them in a vector of structs. This provides strong typing and allows me to access fields using struct field syntax. Is this a good approach when the CSV records have a consistent structure?

struct Record {
    field1: String,
    field2: i32,
    field3: f64,
}

let records: Vec<Record> = // Parse CSV into a vector of records

Additionally, I plan to implement methods that take into account not only specific records but also the aggregation of the data. For example, in the first choice, I'd have to iterate over the entire vector to extract the data I require. Is this approach efficient when dealing with aggregation, or is there a more efficient way to handle it?

Option 2: Struct with Attributes as Vectors
Alternatively, I can create a single struct with attributes as vectors, where each vector contains the data for a specific field. This approach is more flexible and accommodates varying CSV structures. The data is already in a vector within a single struct, making aggregation operations more straightforward. Should I consider this when dealing with CSV data that has different structures across records?

struct CSVData {
    field1: Vec<String>,
    field2: Vec<i32>,
    field3: Vec<f64>,
}

let data: CSVData = // Parse CSV into a struct with vector attributes

I'd love to hear your thoughts on which option you think is more appropriate or if you have any alternative suggestions. Thank you for your help!

In my opinion, a Vector of Structs (or Array of Structs, AoS, as it is usually called) is much simpler to write and use, with a Struct of Vectors (SoA) you might end up building your own entity component system. So unless you really need the performance (benchmark first) or know that the handling is definitely better in the SoA case, go with the first option.

Generally a columnar system (Option 2) is better for memory accesses. If you have to design a parser for CSV structures, consider reading about some of the work done on Arrow's columnar format.

If you just need parse a CSV file and don't want to implement it yourself, a few options include:

  • polars: Data frames and powerful aggregations for data science peoples, uses Arrow for its columns and is a reliable solution for most data exploration purposes
  • csv: Allows you to deserialize your data into structures representing rows using serde
3 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.