CSV: Reading a fixed sized field into an array of char

osrust · February 8, 2021, 9:52am

I am trying to deserialize a large CSV using csv crate file into an struct like:

#[derive(Debug, Deserialize)]
struct Record {
    record_id: u32,
    period: [char; 6],
}

(Note: There are many more fields and this is a contrived example.)

The purpose of using a [char;6] instead of String is reduction in memory used by the parsed structure. But I am getting an error like:

DeserializeError { 
    field: Some(43), 
    kind: Message("expected single character but got 6 characters in \'A1B2C3\'")
}

How can I fix this?
Is there really a memory benefit in using [char; 6] v/s String (since the code does not compile, I cannot benchmark this myself.)

alice · February 8, 2021, 10:04am

Consider using a crate like smallstr with the serde feature enabled.

osrust · February 16, 2021, 11:11am

I was hoping to save big on memory but the crate (or some other hacks I tried) did not make a meaningful difference at the end of the day.

Thanks for the pointer anyways.

drmason13 · February 16, 2021, 12:21pm

That's... Great news! It might not feel like it to you though.

Thanks for reporting back! I am glad to not need to be paranoid about the size of my Strings

EdmundsEcho · February 19, 2021, 1:57am

Just out of curiosity, what is the “final” destination for the data you’re reading-in from csv?

I ask because just maybe you don’t ever have to “take ownership” of the data in your struct before copying it to the final storage facility. Whatever processing in between might only require a lifetime that matches your read access to the data.

osrust · February 19, 2021, 5:06am

Here is the bird's eye view of the flow:

Read from a CSV file A.csv
Some basic processing for each row in A (either some arithmetic over numeric columns or concatenation of string columns)
Write the processed data to B.csv

BurntSushi · February 19, 2021, 7:18am

I think the point here is whether you can do your transformation in a row by row fashion. That is, do you really need to read everything from A before writing anything to B? If not, then you can just read from A and write to B row by row. In that case, you can deserialize into a &str.

EdmundsEcho · February 19, 2021, 12:54pm

Exactly what I said :))

osrust · February 22, 2021, 12:19pm

Thanks! With your tip and a combination of mmap + rayon. I am able to bring the down not just memory usage but also the processing time from ~135s to ~33s for parsing the CSV.

Question: Using perf stat -d <rust_release_binary>, I am getting the following stats:

     113003.484399      task-clock (msec)         #    3.393 CPUs utilized          
             4,403      context-switches          #    0.039 K/sec                  
                 5      cpu-migrations            #    0.000 K/sec                  
         12,78,699      page-faults               #    0.011 M/sec                  
 4,65,40,69,02,392      cycles                    #    4.119 GHz                    
13,74,36,26,61,550      instructions              #    2.95  insn per cycle         
 2,84,05,92,84,983      branches                  # 2513.721 M/sec                  
    2,08,53,19,460      branch-misses             #    0.73% of all branches        
 3,02,38,08,88,053      L1-dcache-loads           # 2675.855 M/sec                  
    1,35,35,51,164      L1-dcache-load-misses     #    0.45% of all L1-dcache hits  
       6,69,99,838      LLC-loads                 #    0.593 M/sec                  
       4,45,23,862      LLC-load-misses           #   66.45% of all LL-cache hits

The last line:

4,45,23,862 LLC-load-misses # 66.45% of all LL-cache hits

How can I reduce the cache miss rates (I mean what kind of code structures could lead to high cache misses) and would attempting to reduce it have a sizable impact?

BurntSushi · February 22, 2021, 12:33pm

Why not share your code and make your benchmark reproducible by others? Then folks might be able to help more.

The only way to answer your question is to try it and measure it.

osrust · February 22, 2021, 12:52pm

Not sure if the code by itself will be of any help in reproducing without the data file (which I have no authority to share). But here is the meat of the code:

Packages and imports:

use env_logger;
use log::error;
use memmap;
use rayon::prelude::*;
use serde::Serialize;
use std::error::Error;
use std::path::Path;

Only a subset of the columns from the input CSV are parsed into the following
struct and serialized to the a new CSV file.

#[derive(Debug, Serialize)]
#[repr(C)]
OutRecord {
    record_id: u32,
    value: f32,
    months: u8,
    loss_amount: f32,
    origin: u32,
    observation: u32,
    origin_half: (origin_month as f32/ 6 as f32).ceil() as u8,
    origin_month: u32,
    due: u32,
    acc_b: f32,
    origin_year: u32,
}

Below is the function that does the heavy-lifting. Outside of this there is just the bootstrap code
which calls this function and does a bit of logging and error handling.

I am not yet writing the generated Vec<OutRecord> to a new file while bench-marking as I am interested in parse speed for now.

fn gen_out_records(abs_path: &str) -> Result<(), Box<dyn Error>> {
    let fp = std::fs::OpenOptions::new()
        .read(true)
        .append(false)
        .create(false)
        .write(false)
        .open(Path::new(&abs_path))?;

    let mmap = unsafe { memmap::Mmap::map(&fp)? };
    let mut cursor_start: usize = 0;
    let mut cursor_end: usize = 0;

    let mut items = Vec::with_capacity(50_000_000);

    for chunk in mmap.chunks(256) {
        for c in chunk {
            if *c == 10u8 { // 10u8 is newline character
                items.push(&mmap[cursor_start..cursor_end]);
                cursor_start = cursor_end + 1;
            }
            cursor_end += 1;
        }
    }

    println!("Starting parallel iterator.");

    let _v: Vec<OutRecord> = items[1..] // Skip first row, it is just headers
        .par_iter()
        .map(|x| std::str::from_utf8(x).unwrap())
        .map(|record| {
            let cells: Vec<&str> = record.split(",").collect();
            let origin = cells[20].trim_matches('"').parse::<u32>().unwrap();
            let months = cells[4].trim_matches('"').parse::<u8>().unwrap();

            let origin_year = origin / 100;
            let origin_month = (origin - 200000) % 100;

            let loss_mid = cells[25].trim_matches('"').parse::<u32>().unwrap_or(0);
            let observation_mid = cells[43].trim_matches('"').parse::<u32>().unwrap();

            let loss_amount = if observation_mid >= loss_mid {
                cells[71].trim_matches('"').parse().unwrap()
            } else {
                0f32
            };

            OutRecord {
                record_id: cells[0].trim_matches('"').parse().unwrap(),
                value: cells[2].trim_matches('"').parse().unwrap(),
                months,
                loss_amount,
                origin: origin - 200000,
                observation: observation_mid - 200000,
                origin_half: (origin_month as f32/ 6 as f32).ceil() as u8,
                origin_month: origin,
                due: cells[49].trim_matches('"').parse().unwrap(),
                acc_b: cells[78].trim_matches('"').parse().unwrap(),
                origin_year,
            }
        })
        .collect();
    Ok(())
}

PS: Note that I have managed to eliminate any string concat operations which were referenced in some of my earlier posts.

EdmundsEcho · February 22, 2021, 7:36pm

What I see as limiting is the allocation for Vec to host the slice: items.push(&mmap[cursor_start..cursor_end].

A couple of ideas: I might create a constructor for the OutRecord type that takes as input the slice (i.e., the logic you now have in the closures). This way you avoid the allocation "in the middle".

When it comes time to allocating to a Vec<OutRecord>, I'm curious to know what @BurntSushi might have to say, but my guess would be that you should just let Rust optimize the growth of Vec (i.e., I question: Vec::with_capacity(50_000_000) unless you know something ahead of time (see next point)).

On a different, but related note, if you do know something about the number of lines/record in the csv (for instance, is there anything strategic about the 256 chunks?, can you infer anything from the length of mmap?), I would use that information in two ways:

replace a for loop with a while loop (I only suggest this for large-scale parsing)
use the information to initialize the hosting Vec (related to the previous point)

This all said, I defer to @BurntSushi who values fast io and was instrumental in much of the Rust io capacity.

BurntSushi · February 22, 2021, 7:48pm

What I would say is that we need a reproducible benchmark so that others can measure. Give us commands to clone a repo and run your benchmark.

osrust · February 23, 2021, 4:34am

@EdmundsEcho
Thanks, I will play around with some of your pointers to see if that helps.

I question: Vec::with_capacity(50_000_000) unless you know something ahead of time

It is a near approximation of the number of lines in the file based on output of wc -l; the actual lines are slightly more than 50 million.

if you do know something about the number of lines/record in the csv (for instance, is there anything strategic about the 256 chunks?

I have some idea about the avg. size of each row and their total number. But the number - 256 - has been arrived at purely by trial and error.

@BurntSushi

What I would say is that we need a reproducible benchmark so that others can measure

Thanks! I will get back to you on this in a few days. The hard part is coming up with a script that can replicate the structure of the CSV I am using but with fake contents.

system · May 24, 2021, 4:35am

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Help With Serde....Lost	9	556	March 26, 2021
Optimize CSV reading code review	6	601	February 14, 2023
How to get first column from library rust-csv? help	6	2899	January 12, 2023
Csv + serde vs non-utf8 (easily) help	5	1420	January 12, 2023
Suggested type for importing from CSV	5	1868	January 12, 2023

CSV: Reading a fixed sized field into an array of char

Related topics