Searching for either an ascii string or a byte pattern in very large files

Hello.

Im a bit of a beginner in rust, i know some of the basics of reading files etc. But im trying to figure out the following.

I need to:

  • search for the first occurrence of a particular byte pattern or a string converted into its ascii representation in an open file.
  • As soon as i find the first occurrence i want to break and return its index.
  • The files are very, very large up to 500/600gb.

I have never written anything like this before so im not really sure where to get started. Usually when i have been able to do searches you can read the entire file into memory and then search, but this time i can't and i guess need to read it into chunks?

For instance how should i deal with the intersection between two chunks? If there is a crate that can do this, its welcome, but im open to writing it myself if needed to learn.

The problem is i have no idea how to think, or where to start? Anyone that could help a noob with some guidance?

Haven't done something like this myself yet, but the aho-corasick crate seems to support files through the stream_find_iter method., Give it a BufReader wrapped file and you should be good to go. Edit: Apparently, the docs recommend not to do extra buffering since an internal buffer is already used by aho-corasick.

If the file you're searching is line oriented and the pattern you're looking for is known to occur within a single line, then the solution is very simple. Open a File, wrap it in a std::io::BufReader and call its lines method. Then search each line. You could easily report the correct offset by keeping track of the starting offset of each line.

If your problem doesn't fit those constraints, then the nexy simplest solution would be to use the memmap crate to get a &[u8] from the File that isn't on the heap and then search that. std doesn't expose a substring routine that works on bytes, but the bstr crate does.

Failing that, the aho-corasick crate provides stream_find_iter which works on arbitrary streams with constant memory and doesn't make any assumptions about lines. While Aho-Corasick is primarily used for multiple pattern search, it can be used for single pattern search too.

4 Likes

Sadly the file is not line oriented so first option is not feasable, the files contain random data and i need to find the first occurrence of "magic bytes" in these very large files.

Your second option, this might be a silly question but when you say from the File that isn't on the heap and then search that what exactly are you meaning? if i read in parts/sections, won't i get problems when searching in the overlapping parts? i wont be able to read in these on the stack, im guessing they are too big for that.

The last option seems the way to go for me, as it should be reading in the file lazy right? and no assumptions about lines seems to be correct too for me.

AFAICT, this is indeed the case

1 Like

Memory map the file. This gives you a byte slice that contains the entire contents of the file. It is not on the heap. The OS will page it into memory for you automatically. (And also page things out of memory as needed.) There is no need to worry about sections or anything else.

3 Likes

I'm having a hard time understanding from its documentation why the map method is unsafe in memmap and what safety conditions one has to meet.

I think that's a big can of worms that could easily derail this thread. I would recommend opening a new thread or searching. It has definitely been discussed before either here or on irlo. To answer briefly, it's unsafe because another process can mutate the &[u8]. What the caller needs to uphold is to guarantee that the mutation doesn't happen. The tricky part is that this is hard to guarantee in practice. And in practice, it's not clear what bad things can happen if the caller doesn't uphold it. In general, file backed memory maps like this just don't seem to be able to be safely abstracted over in Rust.

4 Likes