Mmap vs Vector for large file load and search (string data)

I’m loading giant text files into a program’s memory and then using multithreading to quickly fly through them and perform some regex and other custom checks (ones that cannot be done with ripgrep otherwise I’d just use that ofc).

Say I have a 2GB file of text… What are the pros and cons of just loading the whole thing into a Vector of Strings vs using a mmap? I’ve heard that mmap may use more memory but allow faster access??? But I am actually just scanning sequentially, although with multiple threads, through the data so not sure of which would be better.

Please forgive my ignorance. Thanks.

this does depend on the access pattern and mmap tends to be mostly a performance win for random access, though it is possible to set hints such as madvise(MADV_SEQUENTIAL) to advise only caching ahead

I don't think it's true that mmap uses more memory compared to loading the whole file: if anything it avoids allocating and copying the data into an application's own buffer, but allows the OS to manage it in the page cache, where it can evict pages no longer needed

2 Likes

I’m glad you brought up the OS/page cache/buffer, that was a very important point that I should have realized, but it’s been a while since I was working at the OS level so I forgot about that. I’ve also noticed mmap seems to always be “unsafe” in Rust. Wonder if that would be opening a can of worms since I haven’t done any unsafe stuff in Rust yet :stuck_out_tongue: Granted I come from C but still.

same here; I’ve never used mmap in Rust—didn’t know about the issues surrounding it (How unsafe is mmap?) —apparently the main reason it’s considered unsafe is not because of inherent unsafeness of manipulating the page map directly, but because it cannot be ruled out that another process modifies the mapping at the same time
a whole can of worms, yes…

1 Like

There are a few pros and cons:

  • For small files, reading directly is often faster than mmap. But 2GB is way above the threshold where mmap will usually be faster.
  • mmap won’t work with pipes, and there may be some weird filesystems where it doesn’t work on files. If you care about that, you may want to implement a read-based fallback path.
  • mmapping a 2GB file does not require 2GB of RAM (since the OS can bring pages in and out of RAM as needed), but it does require 2GB of contiguous address space – so don’t expect it to work on a 32-bit system. Of course, if you only plan to run the program on particular systems, that’s not a problem. Alternately, if the alternative involves reading all the data into memory at once, then most 32-bit systems won’t have enough RAM to do the job anyway.
  • With Rust, be careful about the difference between &str (which requires the bytes it points to to be valid UTF-8) and &[u8] (which does not). Note that it is possible to do the UTF-8 check in place, without having to make a copy of the string data. On the other hand, for many purposes it may be faster to just stick with &[u8] and not bother validating.
1 Like