I'm trying to optimise a loop that reads line from compressed tar archives. Currently I'm using flate2 (with zlib-rs) and the tar crate.
- I then use a BufReader on top and use
BufRead::read_until
to read lines (unlikeread_line
this doesn't need UTF8). I reuse a single allocation buffer between loop iterations to avoid allocations. - I then apply a filter that can quickly discard processing about 96% of the lines further (they contain things I'm not interested in).
- Only then do I do UTF8 validation and copying into other data structures on the remaining ~4% of lines.
I have found two inefficiencies when profiling:
- I can not reuse the allocation of the internal buffer in BufReader between entries in the tar archives. There are many small files (plus a few huge ones), so this matters (though much less so than the next point).
read_until
(as well asread_line
) copies data to the buffer I provide. This dominates my processing (apart from decompression, and there is little I can do about that other than using the best implementation and build flags possible).
There is the fill_buf
lower level API but then you give up on a lot of nice things (handling lines longer than the buffer, handling lines crossing the edge of the current buffer, etc). My lines tend to be short so lines longer than the buffer wouldn't be a big concern (I could fall back on a slow path), but lines crossing the buffer edge is. Also it doesn't solve 1.
My search for a crate that solves this without having to roll my own buf reader from scratch and deal with all the fiddly low level details myself has been fruitless.
Ideally I'd like to either have a fallible read_until
(fails if the line won't fit the buffer) that returns a borrow of the buffer, and if the line happens to straddle the end of the buffer copies the half-line to the beginning and then refills the buffer after that.
It should ideally also be possible to reset the buffer such that the allocation can be reused between archive entries.
I would absolutely consider mmap for this normally, but that won't help since the data I'm reading is compressed.