So, amongst others, the API of e.g. serde_json provides a function to parse from a &[u8] via from_slice and a function to parse directly from a io::Read via from_reader.
When one wants to parse a file (or any other io stream, such as HTTP request...), it seems intuitive to do it like this:
let reader = BufReader::new(File::open("myjson.json")?);
let my_thing = serde_json::from_reader(reader)?;
The other possibility is to first read the whole file into memory, then in a second step, parse it:
let reader = BufReader::new(File::open("myjson.json")?);
let mut bytes: Vec<u8> = Vec::new();
let my_thing = serde_json::from_slice(bytes.as_slice());
The second approach ought to consume as much extra memory as the size of the file loaded during reading and thus theoretically / intuitively, it should be slower too, due to more data being allocated and copied.
The puzzling thing is the following comment in the documentation of from_reader:
Note that counter to intuition, this function is usually slower than reading a file completely into memory and then applying from_str or from_slice on it. See issue #160.
In fact, in the tests posted in that ticket, the from_reader approach is about 10x slower than the read_to_end + from_slice approach.
Allocating more data doesn't take more time (or at least not necessarily, it's not that simple). The speed of allocation doesn't necessarily depend on the size of the allocation, while it also depends on a whole lot of other circumstances.
For example, most allocations are typically small, so allocators that manage separate freelists for small and big chunks may very well end up with a fragmented small freelist and an untouched large freelist. This in turn can cause big allocations to be faster than small ones.
Furthermore, copying linear chunks of memory at once is very cheap on modern machines. Due to caching, prefetching, and vectorization, a memcpy() is usually not something you'll want to worry about; it can actually be faster than the allocation itself. Parsing the JSON byte-by-byte can easily be at least 8 times slower than a vectorized memcopy, that copies 8 bytes (assuming a 64-bit machine) per iteration and does nothing else. (Oh, I just saw your 10x experimental figure. Should I buy a lottery ticket?)
And as always, use the right tool for the job. Once your JSON files become so big that they cause your application to be slow, then you are at a point where you should probably abandon reading big JSONs, and use a real database for quickly accessing only the parts of the data you need.
Actually, I brought up serde json as an example because the cited comment from the docs puzzled me, but I understand the speed difference to not have anything to do with parsing JSON in particular, right?
(I reckon now that this assumption may be wrong, as @quinedot mentioned something about scratch space which is specific to that implementation of the JSON parser)
Aso, however, as far as I can follow you, it explains only why parsing from a byte-array slice is faster than from a file, but that byte-array slice also must be read from file first.
I understand the following has to be done in each case:
serde_json::from_reader using a BufReader:
The JSON parser reads byte-by-byte from the BufReader
The BufReader in turn reads in 8KB chunks (DEFAULT_BUF_SIZE) from the file
serde_json::from_slice using a BufReader:
The current implementation of read_to_end reads as many bytes in one go from BufReader as there is capacity in the supplied Vec<u8> buffer and after that apparently only advances in chunks of 32 bytes (as far as I understand the code)
During reading the whole file, the Vec<u8> buffer's capacity must be doubled frequently (at least that's how it is in Java)
The BufReader still reads in 8KB chunks (DEFAULT_BUF_SIZE) from the file
Finally, the JSON parser parses the byte slice from the Vec<u8>
Now, I do understand that point 2 in particular is so fast on modern hardware that the impact on execution speed is negligible according to @H2CO3, but this does not change the fact that the second approach seems to still be doing more than the first approach, so how can it be faster?
After all, byte-array-slice or not, the JSON parser still needs to consume the incoming data byte-by-byte (I guess) in either case.
I haven't benchmarked to see what difference it makes over the BufReader approach you wrote down, but if you were doing the "read entire file" approach today you'd probably use fs::read, which preallocates based on the file size. That said, the cited issue predates fs::read I think.
Anyway, that goes something like
Allocate a big chunk of memory and copy file into it
Only as many syscalls as the OS needs to fill it
...mostly in place, which is CPU cache friendly
File is read in 8k chunks or so
Less memory use but many syscalls to read
Copy bytes from reader to local scratch space as you go
So the entire file isn't just read into a buffer, it's read into a buffer and copied to another memory location (almost twice as much moving bytes around)
Hopefully the function call to read byte by byte is optimized out, but...
...working in two memory places is still less CPU cache friendly, and it's doing that a byte at a time
This guy got a 4-5x speedup by specializing on BufReader to parse everything in the buffer if possible, and if not, to copy a chunk of the buffer into the scratch space instead of copying byte by byte while parsing. That's with no improvements on the syscall front, so I think the byte-by-byte double-buffering impact is pretty significant.
But as I haven't actually benchmarked anything, this is all watercooler speculation talk.
That's quite nice. I also see that the read to end is done without a BufRead, which makes sense when one pre-allocates the Vec to the size of the file.
Too bad though that io::Read doesn't have a size hint like the Iterator has, because of course fs::read is specific to files but io::Read is generic and can be used with any source of data. I guess due to necessary backward compatibility, the Rust team cannot just add new methods to traits over time
(Just a comment from a newbie in the language).
You can, but only if you can also add a default body. There could be a size hint that defaults to None for example.
It would almost always just be a hint; you don't actually know when you're going to get an EOF when reading a file for example (the file size can change while reading and metadata can lie on e.g. fusefs). So an ExactSizeReader for optimization is unlikely.
Why don't we have the size hint method? Too niche I guess. Related things we do have are
And looking at the implementations, BufReader defers to what it's wrapping for read_to_end, and Filetakes its size into account, so there's probably not much difference at all between "manually" doing fs::read.
I find this equivalence highly unlikely. I don't know how serde_json is actually implemented, but fast parsers usually perform all sorts of sneaky tricks in order to skip as many bytes as possible at once. Of course, this can only work (without the author losing all of their hair in a week) when the entire string is already in memory. There are several well-known, fast search algorithms for this purpose, for example Aho & Corasick 1975, which matches several substrings simultaneously, or Boyer & Moore 1977, which gets faster as the pattern gets longer (!).
When you get fed bytes from a reader, such skipping is not generally possible (or at least too hard to bother). Thus, streaming parsers generally need to defer to slower, more naïve string matching algorithms. If you want to find out whethet this is actually the case, you can dig up the Read trait in serde_json and check what its users and implementors do.
If only the US government followed this advice when it required all healthcare providers to provide pricing information for all the services they provide... The average size for these gzipped JSON files is measured in the terabytes and downloading the dataset will cost you tens of thousands of dollars in egress fees
Point 1 is a misunderstanding - there are two 32 byte things. The first is a probe to see if the buffer capacity happened to be exactly the same size as the file. If the read into a 32 byte probe buffer returns 0 bytes read, we know we hit the exact size case, and can exit without increasing the buffer capacity. If the read into a 32 byte probe buffer returns data, we append it to the buffer (causing a reallocation of capacity).
The other 32 byte thing is asking the buffer to expand its capacity by at least 32 bytes; in practice, Vec will double its capacity once the buffer's capacity is more than 32 bytes.
Thus, if you've built a BufReader over a File that's backed by a file that's not being written by another process, not used it (so the buffer is empty), and then call read_to_end, the resulting syscall behaviour is optimized. BufReader determines that its buffer is empty, and thus calls File's read_to_end to do the full read. File then checks the length of the file and sets that as the buffer capacity, before calling the default read_to_end with a suitably large buffer. The result is one syscall if your OS will handle big enough reads in a single syscall - or a loop reading as many bytes as the OS will let you read at a time.