My gripes with BufReader and BufWriter

  • Almost everyone forgets or doesn't know they should use it.
  • It's not clear/obvious when or how to use it.
  • Some things in std come prebuffered, eg. Stdout but others do not, eg. File

Maybe the BufRead and BufWrite traits are lifesavers here, because it will let you know if the thing you have is already benefitting from buffering or not. Perhaps library authors have an incentive to accept Read/Write because it's more permissive, but application developers should annotate their functions with reader: impl BufRead and writer: impl BufWrite to indicate that it's your job to add buffering?. - My reasoning is not because those traits offer some functionality that my function can not live without, but that because often; not buffering leaves a lot of performance on the table.

But again, I don't know really. I'm not sure what the best practices / idiomatic usage is here, perhaps also due in part to a failing to have a necessary low level understanding of what is going on between the read/write and the operating system. Perhaps this entire thread is coming from a place of ignorance.

Does anyone have the magic words that will put me at ease?

Buffering isn't free, and depending on your application, buffering can be a net cost, not a win. Given that there's no "right" answer, Rust leaves you to pick the right answer.

Fundamentally, buffering costs you memory and code complexity (the code doing the buffering) in return for reducing the number of accesses you have to make to the underlying stream. It does this on the read side by making big read requests to fill the buffer, and then copying data from the buffer instead of making more read requests until the buffer is empty, and on the write side by copying your data into the buffer, then writing the buffer out when it's full or it's flushed.

If you're always working in large chunks, or if you need to flush after every operation so that the data gets out, you don't benefit from buffering, but you do pay a cost for having the buffer in place. For example, if you're implementing something like std::io::copy, buffering is going to cost you, but you're working in large chunks in as far as possible, so all reads are as big as or bigger than the read buffer (⇒ a read buffer doesn't help), and all writes are either too big to buffer, or flushed immediately after writing (⇒ a write buffer doesn't help).

Similar can apply even with very small chunks; "what is your name?" is a short string to write, but having written it out, if I'm going to wait for the user to supply input, I'm going to want to flush immediately. That makes the buffer useless if I did the write in one call, since I'm then immediately paying the price of flushing the buffer - whereas with an unbuffered write, I only need make one write call and it's done.

So, given that there are cases where I lose out if the write or read is buffered, the default can't be buffered, since that makes everyone pay the cost of buffering, even when it's not needed.

However, against that, there are plenty of cases where buffering does help; if you're reading byte-by-byte, for example, it's cheaper to do a single big read into a buffer, and then copy bytes out of it than to do reads of a single byte at a time from the underlying source. Similarly, if you're writing 1 to 16 byte chunks from calling to_le_bytes on an integer or floating point number, having the buffer turn that into a small number of big writes is helpful.

So, there needs to be an easy way to add buffering if you know it's needed - but that needs to be used only where appropriate, since otherwise you pay the cost of multiple buffers in a stack, rather than just a single buffer.

10 Likes

@farnz Coming in again, gifting all of us with his wisdom!!

I have to re-read this a few more times, i think, before it will completely click. But I thank you so much for sharing.

2 Likes

There is no BufWrite trait, and wrapping Vec and similar in BufWriter adds unnecessary copying.

BufRead is a bit better since it is directly implemented on Vec and similar, but OTOH libraries may still want to avoid the extra copy and try to issue fewer, larger reads themselves instead.

I think the problem really is that Rust lacks specialization, so libraries can't choose an optimal strategy based on the reader/writer they're given.

5 Likes

Specialisation means knowing the exact type you've been given and being able to write different code paths depending on the type?

1 Like

Surprisingly enough you have already said that word: magic.

That's something Rust hates with passion.

Or, more specifically, it hates fallible magic at runtime. It's Rust solution to a well-known phenomenon.

Leaky abstractions are convenient yet cause tons of grief. Every single system that included automatic buffering had to also include a way to disable it (usually called Direct I/O) which makes the whole construct more complicated and baroque than needed.

Rust does include fair amount of magic (as you noted Stdout is buffered, after all), but it tries to only bring it in places where lack of it would be too confusing for newcomers and would mean people just would stop using it before giving it a serious try.

Ultimately this is the key reason:

It's not clear whether you want buffering or not even on systems where magical auto-buffering is enabled by default, only there you get code that it's fast yet [often] incorrect, while with Rust you usually get code that is correct yet [often] slow.

There are just no silver bullet, that the root cause of all that mess. As we all know there are two hard problems in computer science: cache invalidation, naming things, and off-by-one errors… and buffering is intrinsically tied to the first problem… that is why there are no silver bullet related to buffering.

5 Likes

This might be of interest, in my database software I have a struct that buffers small reads, and uses not one buffer but many. This works in conjunction with a struct that keeps a map of writes, meaning that a write doesn't require the read buffers to be adjusted (eventually they are all discarded when a commit occurs).

The same module also has write buffering ( but that works a bit differently, as the writes are all sorted by file address, so I only need one (large!) write buffer ).

Yes, so that you could add a BufWriter if necessary, and/or have a fast path for appending to a Vec directly (avoiding code bloat from io::Error).

2 Likes