BufReader::read_until without memcpy?

I'm trying to optimise a loop that reads line from compressed tar archives. Currently I'm using flate2 (with zlib-rs) and the tar crate.

  • I then use a BufReader on top and use BufRead::read_until to read lines (unlike read_line this doesn't need UTF8). I reuse a single allocation buffer between loop iterations to avoid allocations.
  • I then apply a filter that can quickly discard processing about 96% of the lines further (they contain things I'm not interested in).
  • Only then do I do UTF8 validation and copying into other data structures on the remaining ~4% of lines.

I have found two inefficiencies when profiling:

  1. I can not reuse the allocation of the internal buffer in BufReader between entries in the tar archives. There are many small files (plus a few huge ones), so this matters (though much less so than the next point).
  2. read_until (as well as read_line) copies data to the buffer I provide. This dominates my processing (apart from decompression, and there is little I can do about that other than using the best implementation and build flags possible).

There is the fill_buf lower level API but then you give up on a lot of nice things (handling lines longer than the buffer, handling lines crossing the edge of the current buffer, etc). My lines tend to be short so lines longer than the buffer wouldn't be a big concern (I could fall back on a slow path), but lines crossing the buffer edge is. Also it doesn't solve 1.

My search for a crate that solves this without having to roll my own buf reader from scratch and deal with all the fiddly low level details myself has been fruitless.

Ideally I'd like to either have a fallible read_until (fails if the line won't fit the buffer) that returns a borrow of the buffer, and if the line happens to straddle the end of the buffer copies the half-line to the beginning and then refills the buffer after that.

It should ideally also be possible to reset the buffer such that the allocation can be reused between archive entries.

I would absolutely consider mmap for this normally, but that won't help since the data I'm reading is compressed.

BufReader provides mutable access to its inner reader, so you can just overwrite it (make sure to flush before).

1 Like

Ah, I guess I could mem swap in the new underlying reader, didn't think of that. Thanks!

That solves point 1, though the more pressing point 2 remains.

I don't think you will be able to solve it without using fill_buf().

Seems I found something myself thanks to an off hand side note in the readme of a crate that didn't quite do what I need (linereader). Apparently bstr has a solution for this. The closure based API is not ideal for me, but I could probably make it work, I will give it a go tomorrow.

(Thanks BurntSushi for all your amazing crates. If only discoverability was better (but that is mostly a Rustdoc problem as well as general poor usability of Google and similar these days).)