What is the absolute fastest way to load 50M f32 from file to memory?


Short Version:

  1. You have complete control over the on-disk file format. 100% your choice.

  2. The in memory representation needs to be Vec

  3. What is the fastest way to read 50M f32 into memory ?

Long Version:

I have a number of unit tests of the form: 1. read some data, 2. transform it into a 50M elem Vec, 3. do some work

Right now, steps 1&2 are dominating the unit test time. Instead, I want to do a two stage process:

stage1: 1. read some data, 2. transofrm it into a 50M elem VEc, 3. save this to disk
stage 2 (unit tests): read pre-formatted 50M elem Vec, do unit test work

Question now is: what is the optimal on-disk format, and what is the optimal way to read it?

I’m on Linux x64. This does NOT need to work on any other platform.



Also, here is the output of

free -h
              total        used        free      shared  buff/cache   available
Mem:            94G        2.4G         88G        255M        3.3G         90G
Swap:          4.0G          0B        4.0G

I’m okah with creating a custom ramdisk, so everything is in memory (even when “on-disk”).



Absolute fastest could be to memmap it.

Other than that, just loading it without extra copies or reallocations should be reasonably fast:

let vec = vec![0f32; 50_000_000];
let data = slice::from_raw_parts_mut(vec.as_mut() as *mut u8, 50_000_000 * 4);


people say that for sequential access memmap is not faster: https://lemire.me/blog/2012/06/26/which-is-fastest-read-fread-ifstream-or-mmap/



If you want to test it on your system, compare rg something-that-does-not-match really-large-file --mmap with rg something-that-does-not-match really-large-file --no-mmap. In my experience, results may vary depending on your environment! Make sure to control for I/O, depending on what you want to measure.



An advantage of mmap here would be reduced memory footprint across multiple parallel instances of your tests - they can all reuse the underlying physical memory, and just have their own mapping. If each test were to read into their own buffer/Vec, it would “duplicate” the data. This all assumes that memory would be read-only.

This may not matter but something to consider. I agree with others that you should test the different approaches.



The absolute fastest way to load lots of data is to include_bytes!("data.raw") it while you compile your application, and then typecast the &[u8; 4*50_000_000] to a &[f32; 50_000_000]. This way, it gets loaded automatically and with negligible overhead while your application loads.

You can’t change the data after you’ve compiled (not easily, anyway), but if it’s something like a neural net model, it might be fine.



Be careful about alignment too!