Understanding claim in rust-cli-app guide

I'm reading the neat rust-cli guide for making cli apps with clap.

I'm almost certainly a misunderstanding from my part, but here is what I find confusing:

There, there is this paragraph:

Exercise for the reader: This is not the best implementation as it will read the whole file into memory, no matter how large the file may be. Find a way to optimize it! (One idea might be to use a BufReader instead of read_to_string().)

The BufReader docs say:

BufReader<R> can improve the speed of programs that make small and repeated read calls to the same file or network socket. It does not help when reading very large amounts at once, or reading just one or a few times. It also provides no advantage when reading from a source that is already in memory, like a Vec<u8>.

So I reasoned this way:

  1. If a file is loaded with read_to_string or read, then we have the file in memory as String or Vec<u8> respectively.
  2. Then using BufReader won't help, as they say in the paragraph.

Probably 1. is not a single syscall, but none of the docs for read_to_string or read state anything related to performance.

While writing I thought, maybe there is some trait they implement and has this info (like Read). It says:

Please note that each call to read() may involve a system call, and therefore, using something that implements BufRead, such as BufReader, will be more efficient.

But aren't we calling .read/read_to_string just once? Possibly not, but I am unsure. I believe a part of my confusion arises from:

  1. There is the trait Read and its method read
  2. There are the implementors, which can use it to read any size.
  3. This means the implementors may choose to read a few bytes and then we need many syscalls to read an entire file with fs::read_to_string or fs::read.

At this point I asked some Chatbots and the answer seems to be that fs::read does perform many small syscalls in a loop.

So I assume the guide is correct, and by using BufRead we guarantee using an implementation of Read that has a larger chunk size and hence performs fewer syscalls.

So is the guide's claim correct? Any conceptual gap you'd fill in, in my description?

Maybe I should look at why fs::read may do many syscalls in the source code myself but I fear doing it since it may be quite hard.

The choice between reading directly from File or from BufReader is not because of the performance of reading the whole file content into memory. Instead, it is the lines() call in the example of your linked article that makes it more great to use BufReader. Imagine that you have a 16GB machine, and the file is about 1TB. If you directly read the content into memory, then we cannot even proceed into the next step since it is out of memory. However, if we use the BufReader, it allows to only read a small chunk of data every time, so the memory will never be exhausted.

P.S. Use mmap (such as the memmap2 crate) can also help dealing with large files, if you understand how linux works under the hood.

P.P.S. There is a small difference between std::fs::read and std::fs::read_to_string: the latter one also conduct unicode validation to make sure the file content can be fit into a Rust String.

1 Like

I see; so read_to_string would be like a BufRead that also collect()s into a String.

This in turn puts the whole file in memory, but we could instead iterate over .lines() whilst avoiding a collect (i.e, we use BufRead.)

This seems to imply there are no relevant differences that would alter performance with respect to the number of syscalls?

PS: I should've also read read_to_end which is used by read_to_string:

This function will continuously call read() to append more data to buf until read() returns either Ok(0) or an error of non-ErrorKind::Interrupted kind.

(Even though this may not be the main issue as you suggest.)

i think the difference is that with the right setup BuffReader is able to gradually load the file as you request new lines from the iterator insead of loading it all at once at the start meaning that you might get better performance because you can do stuff inbetween the loads

1 Like