Fastest way to load a dataset

EDIT:

timing code:


#[test]
fn test_load_ocr_dataset() {
    let start = std::time::Instant::now();
    let mut f = BufReader::new(File::open("/home/data/ttf-flat/ocr_dataset.bincode").unwrap());
    let ocr_data_set: OcrDataSet = bincode::deserialize_from(&mut f).unwrap();
    let end = std::time::Instant::now();
    println!("done: seconds: {:?}", end.duration_since(start).as_secs());
    assert!(false);
}

cargo test test_load_ocr_dataset --release
---- test_load_ocr_dataset stdout ----
done: seconds: 6

This ends up taking 6 seconds to decode 1.7 GB. I want to reduce this time.

=====

This is a follow up to Bincode::serialize slow? , where the goal is now to do something faster than bincode (perhaps via mmap?)

So the problem I'm having is that encoding my full dataset takes 6 seconds:

encoding
done: 1.6913079 gigabytes, seconds: 6

(To the astute reader wondering why 0.5GB -> 1 sec goes to 1.7 GB -> 6 secs, it's due to .as_secs() returns a u64 of the whole seconds, and it's probably close to 1.9).

Anyway, I need to be able to load this data very fast (as it's used in unit tests). Is there a way in Rust to beat bincode by "laying out data structures in a way that is easy to mmap?"

I don't care too much about portability -- as long as I can read back the data on the machine that wrote the data, all is fine. I want to try a solution that just involves a "memory dump" and a "memory load".

Thanks!

I think you might be in the target audience for the abomonation crate. Do bear in mind that memory dump/load serialization is super unsafe in Rust though:

  • Unless your data is repr(C) or you're using the same binary to save and load the data, you're relying on struct layout undefined behavior.
  • Unless your data is repr(C) and you have carefully inserted padding bytes yourself so that no padding is generated by rustc, you're relying on padding bytes readout undefined behavior.
  • Unless you're using my super collection of experimental PRs which have sadly been stalled in the github review pipeline for months, you're relying on ref-to-invalid-data and ref-to-misaligned-data undefined behavior.
  • If you're even as much as thinking to implement the Abomonation trait yourself instead of using mystor's abomonation_derive crate, you need to be extra careful or you'll add even more UB on top of the aforementioned one.

As I said, it's pretty unsafe :wink:

2 Likes

I like the abomonation crate name. I'm not sure if this is the solution I want. As can be guessed by "OcrDataSet", I'm doing ML work on this dataset -- which can be notoriously finicky. The very thought that my load-data process may result in undefined behaviour scares me. :slight_smile:

Here's the thing, I can go a step further and get rid of the String, so my entire struct just consists of 2 levels of Vecs, of u32's and Vec<f32> 's. At this point, I might just write manual <-> Vec<u8> routines.

How does it compare to std::fs::read() and deserializing from memory?

BTW: there's as_msecs(). Also be careful about disk cache skewing speed testing here.

1 Like

Also, rather than writing your own timing code, consider using a benchmarking framework which will run the code multiple times - I've used criterion for this in the past and found it quite nice to use.