#[test]
fn test_load_ocr_dataset() {
let start = std::time::Instant::now();
let mut f = BufReader::new(File::open("/home/data/ttf-flat/ocr_dataset.bincode").unwrap());
let ocr_data_set: OcrDataSet = bincode::deserialize_from(&mut f).unwrap();
let end = std::time::Instant::now();
println!("done: seconds: {:?}", end.duration_since(start).as_secs());
assert!(false);
}
So the problem I'm having is that encoding my full dataset takes 6 seconds:
encoding
done: 1.6913079 gigabytes, seconds: 6
(To the astute reader wondering why 0.5GB -> 1 sec goes to 1.7 GB -> 6 secs, it's due to .as_secs() returns a u64 of the whole seconds, and it's probably close to 1.9).
Anyway, I need to be able to load this data very fast (as it's used in unit tests). Is there a way in Rust to beat bincode by "laying out data structures in a way that is easy to mmap?"
I don't care too much about portability -- as long as I can read back the data on the machine that wrote the data, all is fine. I want to try a solution that just involves a "memory dump" and a "memory load".
I think you might be in the target audience for the abomonation crate. Do bear in mind that memory dump/load serialization is super unsafe in Rust though:
Unless your data is repr(C) or you're using the same binary to save and load the data, you're relying on struct layout undefined behavior.
Unless your data is repr(C) and you have carefully inserted padding bytes yourself so that no padding is generated by rustc, you're relying on padding bytes readout undefined behavior.
Unless you're using my super collection of experimental PRs which have sadly been stalled in the github review pipeline for months, you're relying on ref-to-invalid-data and ref-to-misaligned-data undefined behavior.
If you're even as much as thinking to implement the Abomonation trait yourself instead of using mystor's abomonation_derive crate, you need to be extra careful or you'll add even more UB on top of the aforementioned one.
I like the abomonation crate name. I'm not sure if this is the solution I want. As can be guessed by "OcrDataSet", I'm doing ML work on this dataset -- which can be notoriously finicky. The very thought that my load-data process may result in undefined behaviour scares me.
Here's the thing, I can go a step further and get rid of the String, so my entire struct just consists of 2 levels of Vecs, of u32's and Vec<f32> 's. At this point, I might just write manual <->Vec<u8> routines.
Also, rather than writing your own timing code, consider using a benchmarking framework which will run the code multiple times - I've used criterion for this in the past and found it quite nice to use.