CSV: Reading a fixed sized field into an array of char

I am trying to deserialize a large CSV using csv crate file into an struct like:

#[derive(Debug, Deserialize)]
struct Record {
    record_id: u32,
    period: [char; 6],
}

(Note: There are many more fields and this is a contrived example.)

The purpose of using a [char;6] instead of String is reduction in memory used by the parsed structure. But I am getting an error like:

DeserializeError { 
    field: Some(43), 
    kind: Message("expected single character but got 6 characters in \'A1B2C3\'")
}
  1. How can I fix this?
  2. Is there really a memory benefit in using [char; 6] v/s String (since the code does not compile, I cannot benchmark this myself.)

Consider using a crate like smallstr with the serde feature enabled.

3 Likes

I was hoping to save big on memory but the crate (or some other hacks I tried) did not make a meaningful difference at the end of the day.

Thanks for the pointer anyways. :slight_smile:

That's... Great news! It might not feel like it to you though.

Thanks for reporting back! I am glad to not need to be paranoid about the size of my Strings :slight_smile:

1 Like

Just out of curiosity, what is the “final” destination for the data you’re reading-in from csv?

I ask because just maybe you don’t ever have to “take ownership” of the data in your struct before copying it to the final storage facility. Whatever processing in between might only require a lifetime that matches your read access to the data.

1 Like

Here is the bird's eye view of the flow:

  1. Read from a CSV file A.csv
  2. Some basic processing for each row in A (either some arithmetic over numeric columns or concatenation of string columns)
  3. Write the processed data to B.csv

I think the point here is whether you can do your transformation in a row by row fashion. That is, do you really need to read everything from A before writing anything to B? If not, then you can just read from A and write to B row by row. In that case, you can deserialize into a &str.

2 Likes

Exactly what I said :))

Thanks! With your tip and a combination of mmap + rayon. I am able to bring the down not just memory usage but also the processing time from ~135s to ~33s for parsing the CSV.

Question: Using perf stat -d <rust_release_binary>, I am getting the following stats:

     113003.484399      task-clock (msec)         #    3.393 CPUs utilized          
             4,403      context-switches          #    0.039 K/sec                  
                 5      cpu-migrations            #    0.000 K/sec                  
         12,78,699      page-faults               #    0.011 M/sec                  
 4,65,40,69,02,392      cycles                    #    4.119 GHz                    
13,74,36,26,61,550      instructions              #    2.95  insn per cycle         
 2,84,05,92,84,983      branches                  # 2513.721 M/sec                  
    2,08,53,19,460      branch-misses             #    0.73% of all branches        
 3,02,38,08,88,053      L1-dcache-loads           # 2675.855 M/sec                  
    1,35,35,51,164      L1-dcache-load-misses     #    0.45% of all L1-dcache hits  
       6,69,99,838      LLC-loads                 #    0.593 M/sec                  
       4,45,23,862      LLC-load-misses           #   66.45% of all LL-cache hits 

The last line:

4,45,23,862 LLC-load-misses # 66.45% of all LL-cache hits

How can I reduce the cache miss rates (I mean what kind of code structures could lead to high cache misses) and would attempting to reduce it have a sizable impact?

Why not share your code and make your benchmark reproducible by others? Then folks might be able to help more.

The only way to answer your question is to try it and measure it.

Not sure if the code by itself will be of any help in reproducing without the data file (which I have no authority to share). But here is the meat of the code:

Packages and imports:

use env_logger;
use log::error;
use memmap;
use rayon::prelude::*;
use serde::Serialize;
use std::error::Error;
use std::path::Path;

Only a subset of the columns from the input CSV are parsed into the following
struct and serialized to the a new CSV file.

#[derive(Debug, Serialize)]
#[repr(C)]
OutRecord {
    record_id: u32,
    value: f32,
    months: u8,
    loss_amount: f32,
    origin: u32,
    observation: u32,
    origin_half: (origin_month as f32/ 6 as f32).ceil() as u8,
    origin_month: u32,
    due: u32,
    acc_b: f32,
    origin_year: u32,
}

Below is the function that does the heavy-lifting. Outside of this there is just the bootstrap code
which calls this function and does a bit of logging and error handling.

I am not yet writing the generated Vec<OutRecord> to a new file while bench-marking as I am interested in parse speed for now.

fn gen_out_records(abs_path: &str) -> Result<(), Box<dyn Error>> {
    let fp = std::fs::OpenOptions::new()
        .read(true)
        .append(false)
        .create(false)
        .write(false)
        .open(Path::new(&abs_path))?;

    let mmap = unsafe { memmap::Mmap::map(&fp)? };
    let mut cursor_start: usize = 0;
    let mut cursor_end: usize = 0;

    let mut items = Vec::with_capacity(50_000_000);

    for chunk in mmap.chunks(256) {
        for c in chunk {
            if *c == 10u8 { // 10u8 is newline character
                items.push(&mmap[cursor_start..cursor_end]);
                cursor_start = cursor_end + 1;
            }
            cursor_end += 1;
        }
    }

    println!("Starting parallel iterator.");

    let _v: Vec<OutRecord> = items[1..] // Skip first row, it is just headers
        .par_iter()
        .map(|x| std::str::from_utf8(x).unwrap())
        .map(|record| {
            let cells: Vec<&str> = record.split(",").collect();
            let origin = cells[20].trim_matches('"').parse::<u32>().unwrap();
            let months = cells[4].trim_matches('"').parse::<u8>().unwrap();

            let origin_year = origin / 100;
            let origin_month = (origin - 200000) % 100;

            let loss_mid = cells[25].trim_matches('"').parse::<u32>().unwrap_or(0);
            let observation_mid = cells[43].trim_matches('"').parse::<u32>().unwrap();

            let loss_amount = if observation_mid >= loss_mid {
                cells[71].trim_matches('"').parse().unwrap()
            } else {
                0f32
            };

            OutRecord {
                record_id: cells[0].trim_matches('"').parse().unwrap(),
                value: cells[2].trim_matches('"').parse().unwrap(),
                months,
                loss_amount,
                origin: origin - 200000,
                observation: observation_mid - 200000,
                origin_half: (origin_month as f32/ 6 as f32).ceil() as u8,
                origin_month: origin,
                due: cells[49].trim_matches('"').parse().unwrap(),
                acc_b: cells[78].trim_matches('"').parse().unwrap(),
                origin_year,
            }
        })
        .collect();
    Ok(())
}

PS: Note that I have managed to eliminate any string concat operations which were referenced in some of my earlier posts.

What I see as limiting is the allocation for Vec to host the slice: items.push(&mmap[cursor_start..cursor_end].

A couple of ideas: I might create a constructor for the OutRecord type that takes as input the slice (i.e., the logic you now have in the closures). This way you avoid the allocation "in the middle".

When it comes time to allocating to a Vec<OutRecord>, I'm curious to know what @BurntSushi might have to say, but my guess would be that you should just let Rust optimize the growth of Vec (i.e., I question: Vec::with_capacity(50_000_000) unless you know something ahead of time (see next point)).

On a different, but related note, if you do know something about the number of lines/record in the csv (for instance, is there anything strategic about the 256 chunks?, can you infer anything from the length of mmap?), I would use that information in two ways:

  1. replace a for loop with a while loop (I only suggest this for large-scale parsing)
  2. use the information to initialize the hosting Vec (related to the previous point)

This all said, I defer to @BurntSushi who values fast io and was instrumental in much of the Rust io capacity.

What I would say is that we need a reproducible benchmark so that others can measure. Give us commands to clone a repo and run your benchmark.

@EdmundsEcho
Thanks, I will play around with some of your pointers to see if that helps.

I question: Vec::with_capacity(50_000_000) unless you know something ahead of time

It is a near approximation of the number of lines in the file based on output of wc -l; the actual lines are slightly more than 50 million.

if you do know something about the number of lines/record in the csv (for instance, is there anything strategic about the 256 chunks?

I have some idea about the avg. size of each row and their total number. But the number - 256 - has been arrived at purely by trial and error. :sweat_smile:

@BurntSushi

What I would say is that we need a reproducible benchmark so that others can measure

Thanks! I will get back to you on this in a few days. The hard part is coming up with a script that can replicate the structure of the CSV I am using but with fake contents.

1 Like