Writing / Reading large structs to a file

I have a very large vector containing many records of a given struct that I have to write to a file, and read back later.

I spent some time in rust forums, and tried some of the solutions proposed.
Using bincode it took 300s to serialize, 137s to deserialize with a file size of 1.6Gb
Using rmp_serde, it took 396s to serialize, 180s to deserialize with a file size of 1.4Gb
Using csv it took 6s to serialize and 7s to deserialize with a file size of 1.5Gb.

My background being mainly in C, I was quite surprised as I expected binary formats to be more efficient than a text-based rudimentary format such as csv. I understand of course that using bincode and rmp_serde carries an overhead compared to the crude fwrite() and fread() of C. But is there a way to write/read more efficiently big structures in Rust? Maybe in an unsafe way?

Thanks

Depending on how your struct looks, you might be able to use the zerocopy crate.

1 Like

Something looks really suspicious to me. You don't get a difference of two orders of magnitude by accident. The CSV crate is very well-optimized, but I'd expect bincode to be so, too.

How are you exactly performing the serialization and deserialization (i.e., Rust code), and how are you measuring its speed?

Reading csv:

let file = File::open(my_adresses).unwrap();
        let mut rdr = ReaderBuilder::new()
            .delimiter(b';')
            .from_reader(file);
        let now = Instant::now();
        for result in rdr.deserialize() {
            let  record = result.unwrap();
            addrs.push(record);
        }
        println!("Database read in {:?}",now.elapsed());

Reading bincode:

let now = Instant::now();
    let file_path = "adresses.bin";
    let mut file = File::open(file_path).unwrap();
    let res = bincode::deserialize_from(&mut file);
    println!("{:?}",now.elapsed());

You need to use a BufReader.
bincode makes small reads and that is very costly.

5 Likes

Yeah, in generally, buffering is basically obligatory for squeezing out reasonable performance from big files. It is very clearly indicated in the documentation of csv::ReaderBuilder that it does this automatically for you, because it's the right default. Unfortunately, bincode doesn't appear to do this by itself.

1 Like

Thanks, it solved the problem. With a BufWriter, the structure was written in 2s, 3 times faster than the csv serializer. Identical result for reading. Unfortunate indeed that bincode does not use BufWriter/BufReader by default.
Thank you very much.

That's a deliberate design decision.

Forcing the caller to provide a BufWriter if they want buffered writes means you don't accidentally add multiple levels of buffering (i.e. a BufWriter<BufWriter<File>>), because unnecessary buffering effectively means you copy the data multiple times.

1 Like

I understand that, but it could be mentioned in the documentation of the crate. It would save time, especially when the csv create does exactly the opposite...
Thx

1 Like

I'm the author of the csv crate.

The csv crate is kind of the odd duck here, and that's why it's specifically mentioned in the docs. In the case of csv, it just really doesn't make much sense to not use a buffer of some kind in any case. At some point, the internals have to decide on how big of a block to read from the underlying source and this number is effectively arbitrary. It could be a single byte. The csv parser can handle that. But there's zero advantage to doing so. So it really just makes sense to read as much as one case, up to a certain limit, and then proceed from there.

There are also other API design choices for the csv crate. But in general, the default or "normal" condition is just to accept a Read and require callers to pass a buffered reader if they want to amortize read syscalls.

6 Likes

I wonder how feasible would be a Clippy lint about using BufReader. Clippy seems insightful enough for this to be feasible.