[Solved] How to scan efficiently big binary streams/files?


#1

I would like to implement a “strings” like command line tool in Rust. It should work very fast on big binary files and filter out valid e.g. Unicode strings. The stream can be 32bit, 64bit, big endian or little endian,

I am a beginner in Rust and hope some expert can help:

  1. I need a window of 1024 Bytes over the stream/file that I can advance Byte by Byte until the end of the stream.
  2. Inside the window I search for some valid string encodings meeting some additional criteria and print the string.
  3. When done I advance the window by 1 Byte and go to 2.

Speed is critical this is why a minimum copying is required. How do I implement 1.?
Can I avoid an additional ring-buffer?
If not how to implement it?
Is this possible without unsafe code? Any ideas?

Code extracts are welcome.


Buffered read and write
#2

You might want to take a look at nom for some ideas.


#3

I think you are going to need to build your own ring-buffer that is twice the size of your disk read size. Once you drain half the buffer, you can read the next block into the buffer. I think that gets you around the copy of bytes from an input buffer to the processing ring-buffer.


#4

So while Xi Editor is a editor https://github.com/google/xi-editor it also aims to load data very fast (and is in Rust) so might be interesting to look at also. Now the data structures in Xi may not be exactly what you want in your use-case but still.


#5

I think, instead of buffer window is better to use memory map.


#6

Thank you all for your feedback so far. I will investigate your hints tomorrow.


#7

Just mmap the whole file, it does not use any actual RAM. The OS will deal with paging in as part of the unified buffer cache. Then iterate through it and unmmap when finished.


#8

It might be worth noting that GNU grep actually removed support for mmapping files based on the premise that it was slower. However, my own experience doesn’t match that, and things may have changed in the past six years.


#9

My own benchmarks on Linux show a significant speedup for my C++ parser combinator library when using a FileVector which is an implementation of the C++ vector types interface over mmap, including mutability and extension. I also think mapping a file to a vector is a much better model than a stream in the age of fast seeking SSD devices. Is there something like this for Rust? I don’t think Rust has a trait for something like a RandomAccessIterator, but effectively there should be a trait for a Vec and you should always write code in terms of the trait, that way different implementations of Vec can easily be substituted, like a mmaped one or just a different implementation.


#10

Thank you all for your ideas!

Just mmap the whole file, it does not use any actual RAM. The OS will deal with paging in as part of the unified buffer cache. Then iterate through it and unmmap when finished.

@keean. Are you sure? Does the following code reflect this idea?

    fn foo1() -> std::io::Result<()> {
        let file_mmap = Mmap::open_path("./src/book.txt", Protection::Read).unwrap();
        let bytes: &[u8] = unsafe { file_mmap.as_slice() };
        let mut p = 0;
        while p < bytes.len()-CHUNK_SIZE  {
            println!("{}:\t{}",p,str::from_utf8(&bytes[p..p+CHUNK_SIZE]).unwrap());
            p+=CHUNK_SIZE;
        }
        Ok(())
    }

Note there is no explicit flush() (?). Do I need to flush explicitly in foo2()?
unix.rs implements impl Drop for MmapInner where libc::munmap is called. I wonder if this code is ok. I absolutely need to avoid holding the whole file in memory. These can be very big.


#11

When you memmap a file, the operating system handles loading the contents into memory for you transparently. If I’m allowed to simplify, think of Mmap::open_path as doing nothing other than “hey OS, look at file X, I might need you to take some of that file and stuff it into memory for me.” When you get a slice of bytes (your file_mmap.as_slice() call) and then try to access the bytes in that slice, you’ll generate a page fault because none of those bytes are in memory (maybe). The OS notices the page fault and will react to the page fault by reading a chunk of bytes from that file into memory where your slice is pointing to, and your program is none the wiser. A crucial detail here is that the contents of your file do not live on your heap; they live somewhere else (the page cache).

If your file is so big that it can’t fit in memory, that’s fine. The OS will evict regions of the file that are in memory but aren’t being used to make room for the rest.


#12

I don’t really understand that code, you should treat the mapped memory just as an array, see my C++ vector code as an example. You can mmap a file bigger than physical ram so it obviously does not load it all into memory. What mmap does is allocate the virtual address space for the file, and use the OS buffer cache to cache pages in memory. It automatically handles writing pages back. To ensure data is written you need to use fsync, but I would keep this only for when absolutely needed as it will cause slowdown.


#13

I checked what you said with lldb:
Mmap::open() requests a window of a given length into a file. It returns a pointer to the beginning of that window. Mmap::open() fills as much memory pages as needed with the file data. The crate mmap abstracts hereby from exact boundaries.

Furthermore fn open(file: &File, prot: Protection) -> Result<Mmap> always loads the whole file in memory.

If your file is so big that it can’t fit in memory, that’s fine. The OS will evict regions of the file that are in memory but aren’t being used to make room for the rest.

I don’t have an idea yet how to test this. Open a file bigger then RAM+swap?

Let’s assume transparent paging really works: All I have to do is call Mmap::open() exactly one time with the total length of the file. The I call file_mmap.as_slice() and use it to walk through the file from beginning to the end. And this works even if the file does not fit in memory (please confirm that I understood right).


#14

Yes your understanding is correct…mostly. There are some gotchas.

mmap first needs a contiguous block of free memory in your virtual address space in order to map in the whole file. So if the file is huge and your program has been running for a while, the address space might be fragmented and mmap might not be able to grab enough of the virtual address space to map the whole file in.

This should be rare on a 64 bit architecture, and more of an issue if you are going to also support 32bit machines. Another thing to worry about is under windows 32bit the upper half of the address space is reserved for the kernel so you only have half of the virtual address space to use. I’m not sure how the address is split on 64bit machines on Windows or Linux.

Lastly on 64bit I don’t believe the whole 64bit address is available. I’m not sure if it’s because of CPU address lines or maybe the OS page table entry format or something, but there seems to be an upper limit on the amount of the address space you can access that is less than 64bits would imply.

All that being said, you might want to use mmap more as a window into the file with a fix upper size so that you don’t run into these issues.


#15

Most 64bit OS use 48bits for addressing which is 256TB of space. So unless your file is coming from a huge data-store, virtual address space is not going to be a problem. If we imagine you have 128GB of RAM, the fragmentation will be limited to this region of the virtual address space. Allowing a generous 1TB, you should be able to reliably map 255TB files.

On 32bit systems it is going to be more problematic because virtual address space is 4Gb which it is easy to have all of this in RAM and fragmentation will be a problem.


#16

On 32bit systems it is going to be more problematic because virtual address space is 4Gb which it is easy to have all of this in RAM and fragmentation will be a problem.

All that being said, you might want to use mmap more as a window into the file with a fix upper size so that you don’t run into these issues.

Then it is better to do it this way:

        let f = try!(File::open("./book.txt"));
        let l = try!(f.metadata()).len() as usize;
        let mut p = 0;
        while p < l {
            let mmap = Mmap::open_with_offset(&f, Protection::Read,p,CHUNK_SIZE).unwrap();
            print!("\n{}:\t",p);
            io::stdout().write_all(unsafe { mmap.as_slice() }).unwrap();
            p+=CHUNK_SIZE;
        }

BTW: Do I need to flush explicitly?
unix.rs implements impl Drop for MmapInner where libc::munmap is called.


#17

If by “64-bit” you mean current AMD/Intel chips, they use a 3-level page table with 512 8-byte entries in each table level, and 4KB final pages = 2^48 total addressable bytes. The hardware is set up so that 2^47 of those bytes have positive addresses and 2^47 have negative addresses; the middle portion of the address space from 2^47 to 2^64-2^47 does not correspond to any page table path and is unconditionally unaddressable. (They could have provided 4-level page tables, but that increases the cost and/or complexity of a TLB miss). All operating systems I am familiar with for amd64 use the negative half of the address space for kernel addresses and the positive half for user addresses.

I’m not familiar with the details of virtual memory on other 64-bit CPUs. I think ARMv8 uses the same basic page table structure as amd64 with the default 4KiB page size, but there are a number of other options defined.

32-bit x86 hardware uses 2-level 1024 entry * 4 byte page tables, allowing all 2^32 addresses to be mapped by the page tables. Linux defaults to a 3GB user / 1GB kernel split, which can be overridden by setting CONFIG_2G_2G (sp?) at kernel compile time; Windows does 2GB / 2GB by default but I think there’s a /3GB boot-time option to get the Linux behavior.


#18

Further reading:


#19

This analysis ignores “huge pages” which are useful for this kind of mapping.

Personally I wouldn’t bother worrying about 32bit systems unless you have a specific 32bit target in mind. I would write for a modern 64bit target and use linear mapping of the whole file because is will be faster and simplifies the code as you can hold pointers into the file etc.


#20

Yes it does and yes they are. AFAIK they don’t have an effect on the size of the user VA space though on x86, amd64, or aarch64, since they just chop off the bottom level or two (1GB) of page tables. That makes for fewer TLB misses, but the size of files you can mmmap (effectively unlimited on 64-bits) is unchanged.