What is the absolute fastest way to load 50M f32 from file to memory?

Short Version:

  1. You have complete control over the on-disk file format. 100% your choice.

  2. The in memory representation needs to be Vec

  3. What is the fastest way to read 50M f32 into memory ?

Long Version:

I have a number of unit tests of the form: 1. read some data, 2. transform it into a 50M elem Vec, 3. do some work

Right now, steps 1&2 are dominating the unit test time. Instead, I want to do a two stage process:

stage1: 1. read some data, 2. transofrm it into a 50M elem VEc, 3. save this to disk
stage 2 (unit tests): read pre-formatted 50M elem Vec, do unit test work

Question now is: what is the optimal on-disk format, and what is the optimal way to read it?

I'm on Linux x64. This does NOT need to work on any other platform.

Also, here is the output of

free -h
              total        used        free      shared  buff/cache   available
Mem:            94G        2.4G         88G        255M        3.3G         90G
Swap:          4.0G          0B        4.0G

I'm okah with creating a custom ramdisk, so everything is in memory (even when "on-disk").

Usually people recommend memmap for such things, but memory mapping comes uneven performance (caching beyond your control), and tough gotchas around concurrency and error handling.

Other than that, just loading it with one syscall, without copies or reallocations should be reasonably fast:

let mut buf = Vec::with_capacity(50_000 * 4);
reader.by_ref().take(50_000 * 4).read_to_end(&mut buf)?;
// https://rust-lang.github.io/rfcs/2835-project-safe-transmute.html
// http://lib.rs/bytemuck
let slice = std::slice_from_raw_parts(buf.as_ptr().cast::<f32>(), vec.len() / 4);

people say that for sequential access memmap is not faster: Which is fastest: read, fread, ifstream or mmap? – Daniel Lemire's blog

If you want to test it on your system, compare rg something-that-does-not-match really-large-file --mmap with rg something-that-does-not-match really-large-file --no-mmap. In my experience, results may vary depending on your environment! Make sure to control for I/O, depending on what you want to measure.


An advantage of mmap here would be reduced memory footprint across multiple parallel instances of your tests - they can all reuse the underlying physical memory, and just have their own mapping. If each test were to read into their own buffer/Vec, it would “duplicate” the data. This all assumes that memory would be read-only.

This may not matter but something to consider. I agree with others that you should test the different approaches.


The absolute fastest way to load lots of data is to include_bytes!("data.raw") it while you compile your application, and then typecast the &[u8; 4*50_000_000] to a &[f32; 50_000_000]. This way, it gets loaded automatically and with negligible overhead while your application loads.

You can't change the data after you've compiled (not easily, anyway), but if it's something like a neural net model, it might be fine.


Be careful about alignment too!