Performance reading file: Parse from `io::Read` vs from `&[u8]`

So, amongst others, the API of e.g. serde_json provides a function to parse from a &[u8] via from_slice and a function to parse directly from a io::Read via from_reader.

When one wants to parse a file (or any other io stream, such as HTTP request...), it seems intuitive to do it like this:

let reader = BufReader::new(File::open("myjson.json")?);
let my_thing = serde_json::from_reader(reader)?;

The other possibility is to first read the whole file into memory, then in a second step, parse it:

let reader = BufReader::new(File::open("myjson.json")?);
let mut bytes: Vec<u8> = Vec::new();
reader.read_to_end(&mut bytes)?;
let my_thing = serde_json::from_slice(bytes.as_slice());

The second approach ought to consume as much extra memory as the size of the file loaded during reading and thus theoretically / intuitively, it should be slower too, due to more data being allocated and copied.


The puzzling thing is the following comment in the documentation of from_reader:

Note that counter to intuition, this function is usually slower than reading a file completely into memory and then applying from_str or from_slice on it. See issue #160.

In fact, in the tests posted in that ticket, the from_reader approach is about 10x slower than the read_to_end + from_slice approach.

Can anyone explain why?

1 Like

After a short browse, the io reader has to proceed byte by byte and keep a scratch buffer until it hits a point where it can parse a chunk of incoming JSON. Everything goes through the scratch buffer.

Whereas if it's all already in memory, you can scan arbitrarily ahead and doesn't always need a scratch buffer.

E.g. relevant comment for string parsing.

1 Like

Allocating more data doesn't take more time (or at least not necessarily, it's not that simple). The speed of allocation doesn't necessarily depend on the size of the allocation, while it also depends on a whole lot of other circumstances.

For example, most allocations are typically small, so allocators that manage separate freelists for small and big chunks may very well end up with a fragmented small freelist and an untouched large freelist. This in turn can cause big allocations to be faster than small ones.

Furthermore, copying linear chunks of memory at once is very cheap on modern machines. Due to caching, prefetching, and vectorization, a memcpy() is usually not something you'll want to worry about; it can actually be faster than the allocation itself. Parsing the JSON byte-by-byte can easily be at least 8 times slower than a vectorized memcopy, that copies 8 bytes (assuming a 64-bit machine) per iteration and does nothing else. (Oh, I just saw your 10x experimental figure. Should I buy a lottery ticket?)

And as always, use the right tool for the job. Once your JSON files become so big that they cause your application to be slow, then you are at a point where you should probably abandon reading big JSONs, and use a real database for quickly accessing only the parts of the data you need.

2 Likes

Actually, I brought up serde json as an example because the cited comment from the docs puzzled me, but I understand the speed difference to not have anything to do with parsing JSON in particular, right?
(I reckon now that this assumption may be wrong, as @quinedot mentioned something about scratch space which is specific to that implementation of the JSON parser)

Aso, however, as far as I can follow you, it explains only why parsing from a byte-array slice is faster than from a file, but that byte-array slice also must be read from file first.


I understand the following has to be done in each case:

serde_json::from_reader using a BufReader:

  1. The JSON parser reads byte-by-byte from the BufReader
  2. The BufReader in turn reads in 8KB chunks (DEFAULT_BUF_SIZE) from the file

serde_json::from_slice using a BufReader:

  1. The current implementation of read_to_end reads as many bytes in one go from BufReader as there is capacity in the supplied Vec<u8> buffer and after that apparently only advances in chunks of 32 bytes (as far as I understand the code)
  2. During reading the whole file, the Vec<u8> buffer's capacity must be doubled frequently (at least that's how it is in Java)
  3. The BufReader still reads in 8KB chunks (DEFAULT_BUF_SIZE) from the file
  4. Finally, the JSON parser parses the byte slice from the Vec<u8>

Now, I do understand that point 2 in particular is so fast on modern hardware that the impact on execution speed is negligible according to @H2CO3, but this does not change the fact that the second approach seems to still be doing more than the first approach, so how can it be faster?
After all, byte-array-slice or not, the JSON parser still needs to consume the incoming data byte-by-byte (I guess) in either case.

I haven't benchmarked to see what difference it makes over the BufReader approach you wrote down, but if you were doing the "read entire file" approach today you'd probably use fs::read, which preallocates based on the file size. That said, the cited issue predates fs::read I think.

Anyway, that goes something like

  • Allocate a big chunk of memory and copy file into it
    • Only as many syscalls as the OS needs to fill it
  • Parse slice...
    • ...mostly in place, which is CPU cache friendly

Versus

  • File is read in 8k chunks or so
    • Less memory use but many syscalls to read
  • Copy bytes from reader to local scratch space as you go
    • So the entire file isn't just read into a buffer, it's read into a buffer and copied to another memory location (almost twice as much moving bytes around)
    • Hopefully the function call to read byte by byte is optimized out, but...
    • ...working in two memory places is still less CPU cache friendly, and it's doing that a byte at a time

This guy got a 4-5x speedup by specializing on BufReader to parse everything in the buffer if possible, and if not, to copy a chunk of the buffer into the scratch space instead of copying byte by byte while parsing. That's with no improvements on the syscall front, so I think the byte-by-byte double-buffering impact is pretty significant.


But as I haven't actually benchmarked anything, this is all watercooler speculation talk.

5 Likes

That's quite nice. I also see that the read to end is done without a BufRead, which makes sense when one pre-allocates the Vec to the size of the file.
Too bad though that io::Read doesn't have a size hint like the Iterator has, because of course fs::read is specific to files but io::Read is generic and can be used with any source of data. I guess due to necessary backward compatibility, the Rust team cannot just add new methods to traits over time
(Just a comment from a newbie in the language).

You can, but only if you can also add a default body. There could be a size hint that defaults to None for example.

It would almost always just be a hint; you don't actually know when you're going to get an EOF when reading a file for example (the file size can change while reading and metadata can lie on e.g. fusefs). So an ExactSizeReader for optimization is unlikely.

Why don't we have the size hint method? Too niche I guess. Related things we do have are

And looking at the implementations, BufReader defers to what it's wrapping for read_to_end, and File takes its size into account, so there's probably not much difference at all between "manually" doing fs::read.

2 Likes

I find this equivalence highly unlikely. I don't know how serde_json is actually implemented, but fast parsers usually perform all sorts of sneaky tricks in order to skip as many bytes as possible at once. Of course, this can only work (without the author losing all of their hair in a week) when the entire string is already in memory. There are several well-known, fast search algorithms for this purpose, for example Aho & Corasick 1975, which matches several substrings simultaneously, or Boyer & Moore 1977, which gets faster as the pattern gets longer (!).

When you get fed bytes from a reader, such skipping is not generally possible (or at least too hard to bother). Thus, streaming parsers generally need to defer to slower, more naïve string matching algorithms. If you want to find out whethet this is actually the case, you can dig up the Read trait in serde_json and check what its users and implementors do.

If only the US government followed this advice when it required all healthcare providers to provide pricing information for all the services they provide... The average size for these gzipped JSON files is measured in the terabytes and downloading the dataset will cost you tens of thousands of dollars in egress fees :upside_down_face:

4 Likes

Related, burntsushi's regex crate has been looking at this issue for a relative long time.

2 Likes

:scream_cat: It never ceases to amaze me how much technical incompetence governments and related organizations manage to accumulate.

1 Like

Point 1 is a misunderstanding - there are two 32 byte things. The first is a probe to see if the buffer capacity happened to be exactly the same size as the file. If the read into a 32 byte probe buffer returns 0 bytes read, we know we hit the exact size case, and can exit without increasing the buffer capacity. If the read into a 32 byte probe buffer returns data, we append it to the buffer (causing a reallocation of capacity).

The other 32 byte thing is asking the buffer to expand its capacity by at least 32 bytes; in practice, Vec will double its capacity once the buffer's capacity is more than 32 bytes.

Point 3 in this list is slightly off, because BufReader has an optimized implementation of read_to_end which drains the buffer, and then passes through to the underlying Read implementation, instead of refilling the buffer.

In turn, this can, depending on the underlying implementation, make point 2 false - for example File has a read_to_end implementation that resizes the buffer capacity to the correct size immediately via a private buffer_capacity_required method.

Thus, if you've built a BufReader over a File that's backed by a file that's not being written by another process, not used it (so the buffer is empty), and then call read_to_end, the resulting syscall behaviour is optimized. BufReader determines that its buffer is empty, and thus calls File's read_to_end to do the full read. File then checks the length of the file and sets that as the buffer capacity, before calling the default read_to_end with a suitably large buffer. The result is one syscall if your OS will handle big enough reads in a single syscall - or a loop reading as many bytes as the OS will let you read at a time.

3 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.