Seek for deflate and bzip

Is there a library for gzip and bzip2 that supports Seek operations?

I understand that it is not trivial (in the sense of O(1)) to seek an arbitrary compressed stream.
My use case is that I want to read something similar to a tgz archive,
and I want to jump to the start of an arbitrary file entry in the archive
without re-inflating all the contents before the file I want to seek.

One possible implementation is with this pseudocode:

let mut tar = GzipReader::new(&file);
let mut entries = vec![];
for _ in 0..num_files {
    entries.push(read_entry(&mut tar));
}
for entry in &mut entries {
    entry.seek = file.seek(SeekFrom::Current(0))?;
    entry.read_state = tar.clone();
}

fn read_file(index: usize) -> impl Read {
    let entry = &entries[index];
    file.seek(SeekFrom::Start(entry.seek))?;
    entry.read_state.take(entry.file_size)
}

then read_file effectively seeks to the offset at the start of file below the gzip layer.

However, apparently, current compression crates (flate2 and bzip2) do not support use of Clone like this; the shared nature of std::fs::File is also not compatible with all Read implementations.

Is there a more idiomatic way to solve this problem? Are there compression libraries dedicated to solving this?

It's not entirely clear to me what you're trying to do.

In a gzip stream you have to inflate everything from the start to the point you want to seek to. You could cache already-inflated data in order to seek quickly to it next time, but you can't avoid inflating from the start at least once.

If you have enough memory to store the whole archive in memory, the easiest solution is to ungzip the whole thing to a Vec, and then use io::Cursor to have a seekable and readable container.

If you want an archive with ability to quickly decompress arbitrary files, use zip or zip-like approach where every file is gzipped individually.

I understand that I have to inflate everything at least once. But I don't want to inflate everything O(n) times.
Say tar archive for example. I can scan the whole archive to construct a name => start position map. Then I might start using the data in a file called foo, which tells me to load data from another file called bar. Maybe the tar is lay out like this:

{bar header}
{foo header}
{some unused stuff here}
{bar contents}
{foo contents}

I would first read everything, then I would seek to {foo contents}, then I would seek to {bar contents}.

I don't want every such seek to inflate everything again. In particular, {some unused stuff here} should not be inflated every time I want to seek to another file.

Use read_to_end() to consume everything from gzip reader into a Vec, discard the gzip reader, and from then on use only the uncompressed Vec (wrapped in io::Cursor) as your only data source.

Alternatively, decompress whole tar.gz in one go, and keep uncompressed files somewhere (in a Vec or HashMap) for "seeking" them later.

Basically the solution to lack of seekability in gzip is to completely get rid of gzip.

If you want to do anything smarter than naive full decompress and keeping of full uncompressed data, then I'm afraid high-level crates can't do this.

bzip2 divides stream into ~900KB blocks, so perhaps these could be skippable. I don't know enough about this format to advise.

gzip has only a 32KB backreference window, but that's 32KB of decoded data that could have been copied from earlier 32KB window, which could have been copied from earlier 32KB, etc. In theory you could try writing a custom gzip decoder that finds a nearest block to decode and then goes backwards to find how many previous blocks it needs to decode to get the correct data. I don't if such decoder even exists.

We have this problem at work. If you can spend the time to scan the files first and save extra information, it's sort-of possible.

For GZIP there isn't a block structure (or not one that's exposed to the caller at least) but you can save the 32kb window that @kornel mentioned along with the byte offsets into the compressed and uncompressed data. Then you can seek to the right position in the compressed file, restore the 32kb window and begin decompression from there. The limit with this technique is that you have to decompress the whole file once to save checkpoints at the offsets you want to seek to.

For bz2 it's harder because bz2 doesn't expose the window like gzip does. The blocks boundaries also don't have to be aligned to whole bytes, they can be 0-7 bits offset from the byte boundary. On top of that there's a running checksum over the whole stream that will raise an error if you try to skip data. The only way I was able to make it work was to peek at the decompressor's private state (which is easier to do in C) and save that at each block boundary.

In either case you'll have to implement a bunch of custom code to make ot work; arbitrary seeking is not something that most compression algorithms are designed to provide.

1 Like

Unfortunately that's the file format I need to deal with. Not something I could change.

That's allocates the whole bufer in memory. That's something I would want to avoid. Might store it in a temp file instead.

Perhaps a better solution would be to store a 32KB state for every, say, section of several megabytes of the file or so?

Hi, maybe this can help: https://rushter.com/blog/gzip-indexing/

they built a Index with each chunk. Is implemented in Python

1 Like