Zero copy lexing w/ Buffered IO

Trying to translate techniques I've used in C++ before to Rust.

First I want to stream in a file doing buffered reads (no mmap, no loading the whole file into memory). No problem, rust has std::io::BufRead and std::io::BufReader.

Alright but I'm reading text and need to match individual Unicode characters. No problem, the utf8_chars crate provides BufReadCharsExt which extends BufRead to have a chars iterator.

Now my problem is I want my lexer once it finds a whole token to pass tokens as &str to my parser, where the &str refers directly to bytes in the IO buffer.

  1. chars makes an iterator that gives io::Result<char> which is a copy out of the buffer and into a char, which is not desired.

  2. Even if I ignore that copy, I don't see a way to get a &str to the underlying character(s). I want to check if I've hit whitespace and if I haven't keep expanding my &str until it contains a whole token.

  3. A token might be long (e.g. if the data were JSON a string might be one arbitrarily long token). This means the IO buffer may need to grow. I need a way to express "load more without actually counting the existing bytes as consumed."

Is there a good existing way to get this combo or do I need to roll my own?

I don't know how to square these two. That &str has to reference something, and it can't reference the buffer in the BufReader because it gets reused, as you're expecting in order to not "load the whole file into memory".

Or is the parser going to make its own copy of the &strs that you push into it, so the &str only needs to live for the duration of that push into the parser code?

Basically, this is easy if you just fs::read_as_string the whole thing, because then you can just store references into that buffer everywhere. But if you don't want to do that, it's unclear to me where you're planning on storing the data for an arbitrarily-long token.

4 Likes

I should have mentioned I only need to keep a reference to one token at a time. So if the arbitrarily long str doesn't fit in the IO buffer, in that case only I would allocate a new bigger buffer with the copy of the str so far (basically a Vec capacity increase), and keep expanding the slice until I hit whitespace. Otherwise I never copy.

So the manual version of this is something like:

  • Stream IO_SIZE bytes into Vec<u8> which starts with at least IO_SIZE capacity
  • Parse as much utf8 out of the buffer as possible, getting a &str pointing into the buffer
  • Looping through the str we got looking for whitespace. We start with an empty slice into it. For every character we get that isn't whitespace expand our slice. When we hit whitespace we pass the current slice to the parser and reset the slice to be empty.
  • When we reach the end of the &str mark all whitespace bytes and bytes we already passed to the parser as read. Remaining unparsed bytes get memmoved to the front of the buffer.
  • If we exhaust the &str without seeing any whitespace the token is too long to fit in the buffer. Grow the capacity of the Vec<u8> and start again.

When you are reading the bytes of the underlying &str, you are also "copying" bytes out of it. This is just as much of a copy as using the chars() iterator (operating on the string slice directly needs to do the UTF-8 decoding as well, in order to determine code point boundaries). Anyway, since this copy is trivial, it will be optimized well, so you don't need to worry about it.

So you go to a great length to avoid unnecessary copies to just use overlapping copy at the end? Looks quite weird to me.

I think the best performance can be only achieved by manually working with pages. It could look roughly like this:

  1. Allocate one or several RAM pages, save starting address of the first page as base and number of allocated pages as number_of_pages. Set head and tail variables to 0.
  2. If PAGE_SIZE * number_of_pages - tail is less then threshold, allocate new pages at the end by using mremap. Save address of the new memory region into base and update number_of_pages.
  3. read data into buffer base[tail..PAGE_SIZE * number_of_pages], increment tail by number of read bytes.
  4. Parse &str inside buffer base[head..tail]. If you got your token, then yield it as &str; else go to the step 2.
  5. After the token got processed by user and released, increment head by number of bytes in the token.
  6. munmap all pages below head if any. Decrement both head and tail by PAGE_SIZE * number_of_unmaped_pages. Increment base by the same amount. Decrement number_of_pages by number_of_unmaped_pages.
  7. Go to the step 2.

Effectively, you manually implement a simplified version of mmap-ed file.

There's a difference:

  1. inefficient: copy bytes from IO buffer into separate token buffer, followed by movs that load bytes being operated on in token buffer into registers

  2. efficient: only movs of that load bytes being operated on directly out of IO buffer into registers

This is generally what people mean when they refer to "zero copy." Whichever buffer the data first arrives in from IO is the only memory area where it stays for the duration of processing. There are senses in which copies still happen, like the loads to get the data into registers and it's always possible in the generated code there is a register spill so some of that data even gets copied out of registers back into memory, but there are no explicit memcpy's.

You can't rely on the compiler to do this in general. Even if the optimizer were reliable figuring out the copy is unnecessary can require looking across compilation units. You have to explicitly design your API to pass slices/pointers.

I don't understand what you are getting at. There's no need to copy the string into a separate buffer in order to iterate over chars of it. The chars() iterator can (and likely does) eat the chars one-by-one, only consuming as many bytes as needed, from whatever buffer contains them.

It's actually pretty efficient for most streaming IO cases I've dealt with in practice. Most of the time the number of unparsed bytes at the end is very small because the common case for there being any remaining is when a UTF-8 char is only partially loaded. Assuming downstream API needs a contiguous set of bytes, you either have to memmove bytes down to the beginning of the buffer or allocate a new bigger buffer and memcpy the unparsed bytes to the new buffer. So you're doing a tiny copy either way, may as well avoid also allocating.

There is a second less common way it can happen where since we are trying to keep whole tokens in memory at once and they could be long we could end up memmoving a lot down. However, if we are dealing with gigantic tokens like this we will usually end up triggering buffer resizing, after which this case should become rare again.

But how do I recover a &str into the existing IO buffer for passing to consumers downstream? chars is giving me the characters by value. It would still be more work for me to copy those chars into a new separate token buffer just for sake of getting &str.

There are many versions of "small strings" implementations out there. They use stack memory for small strings, and heap strings for longer ones. This could improve cache locality. Crate smol-str looks interesting.

When I was implementing my own UTF8 conversion crate (utf8conv) I noticed my iterators cannot produce a &char result, because conversion results are temporary values; I was not able to vend out a reference to a short lived value. &str could have similar issues.

Note that I don't know any situation in which returning a &char is useful. It's a 4-byte Copy type; just return it by value. (No reason to copy a potentially-8-byte-pointer to avoid copying 4 bytes.)

3 Likes

noms approach is to return a NeedMoreData(minimum number of bytes needed).
Once the extra data has been read, the parser can retry.

BufReader has buffer() and fill_buf() methods that return the underlying byte slice, and there is char::len_utf8() for retrieving the number of bytes a char csn be represented with.

1 Like