What is the absolute fastest way to load 50M f32 from file to memory?


#1

Short Version:

  1. You have complete control over the on-disk file format. 100% your choice.

  2. The in memory representation needs to be Vec

  3. What is the fastest way to read 50M f32 into memory ?

Long Version:

I have a number of unit tests of the form: 1. read some data, 2. transform it into a 50M elem Vec, 3. do some work

Right now, steps 1&2 are dominating the unit test time. Instead, I want to do a two stage process:

stage1: 1. read some data, 2. transofrm it into a 50M elem VEc, 3. save this to disk
stage 2 (unit tests): read pre-formatted 50M elem Vec, do unit test work

Question now is: what is the optimal on-disk format, and what is the optimal way to read it?

I’m on Linux x64. This does NOT need to work on any other platform.


#2

Also, here is the output of

free -h
              total        used        free      shared  buff/cache   available
Mem:            94G        2.4G         88G        255M        3.3G         90G
Swap:          4.0G          0B        4.0G

I’m okah with creating a custom ramdisk, so everything is in memory (even when “on-disk”).


#3

Absolute fastest could be to memmap it.

Other than that, just loading it without extra copies or reallocations should be reasonably fast:

let vec = vec![0f32; 50_000_000];
let data = slice::from_raw_parts_mut(vec.as_mut() as *mut u8, 50_000_000 * 4);
file.read_exact(data);

#4

people say that for sequential access memmap is not faster: https://lemire.me/blog/2012/06/26/which-is-fastest-read-fread-ifstream-or-mmap/


#5

If you want to test it on your system, compare rg something-that-does-not-match really-large-file --mmap with rg something-that-does-not-match really-large-file --no-mmap. In my experience, results may vary depending on your environment! Make sure to control for I/O, depending on what you want to measure.


#6

An advantage of mmap here would be reduced memory footprint across multiple parallel instances of your tests - they can all reuse the underlying physical memory, and just have their own mapping. If each test were to read into their own buffer/Vec, it would “duplicate” the data. This all assumes that memory would be read-only.

This may not matter but something to consider. I agree with others that you should test the different approaches.


#7

The absolute fastest way to load lots of data is to include_bytes!("data.raw") it while you compile your application, and then typecast the &[u8; 4*50_000_000] to a &[f32; 50_000_000]. This way, it gets loaded automatically and with negligible overhead while your application loads.

You can’t change the data after you’ve compiled (not easily, anyway), but if it’s something like a neural net model, it might be fine.


#8

Be careful about alignment too!