Clarification of std::io::Read (with multi-stream gzip files and flate2)

Hello,

I'm a beginner with Rust and I'm trying to understand how to properly use the fn read(&mut self, buf: &mut [u8]) -> Result<usize> method of io::Read.

For what it's worth, in my program, this method is implemented by flate2::bufread::GzDecoder. I'm reading large gz archives, streaming lines one at a time. There are cases when the method returns Ok(0) before the entire file has been read. I'm having some trouble understanding the contract:

If the return value of this method is Ok(n), then it must be guaranteed that 0 <= n <= buf.len(). A nonzero n value indicates that the buffer buf has been filled in with n bytes of data from this source. If n is 0, then it can indicate one of two scenarios:

  1. This reader has reached its "end of file" and will likely no longer be able to produce bytes. Note that this does not mean that the reader will always no longer be able to produce bytes.
  2. The buffer specified was 0 bytes in length.

In the first case, it says that the reader could later again be able to produce bytes. How can you tell when that time comes? Do we need to implement some polling? But how can you tell when the reader is no longer able to produce bytes, "for real" this time?

Thank you!

Cheers,
Radu

1 Like

The implementor of Read is free to do whatever it pleases, so, in general, you can't tell. I assume std::fs::File implements Read::read with a POSIX read on Linux or Mac, so if you are reading a file from the start it will only return 0 once you reach the end of the file and from then on it will always return 0 if reading is all you are doing. So it's always "for real". Unless you later do something like using File::seek to repositon the file cursor then read may again be able to read the file.

Getting notified when a file is ready for reading is async I/O territory, which the Rust standard library dosen't do, so it's not something that should concern you when just reading a file using stdlib.

Alright, thanks for the clarification!

There are cases when the method returns Ok(0) before the entire file has been read.

Hmm. I think that sounds like a serious bug in flate2. Do you have example input that definitely causes this behavior? If so, please consider filing an issue.

/cc: @alexcrichton

Update: There's a comment in the flate2 source that seems to agree that this would be a bug. I've reviewed the code and I honestly can't see (yet) how it could be wrong. A test case would help.

2 Likes

(Bugs like this are not super rare. read seems straightforward, but it's deceptively tricky to use. Python's stdlib had a funny read bug.)

I wanted to be sure that I properly understand the API before drawing any conclusions. I also think that the problem is elsewhere.

@jorendorff Thanks for the link to the source code comment. I'll try to setup a very simple test case to reproduce this and report an issue. Unfortunately, in my own playground program, I'm only seeing this bug for the largest input file (~ 5.8 GB).

Cheers!

1 Like

Oh dear, this does indeed sound like a bug! I'd definitely appreciate a bug report :slight_smile:

In general though some I/O objects can hit EOF multiple times, but AFAIK that's limited to terminals. Each time you hit ctrl-d at a terminal it'll make any currently blocking read return with a value of 0 bytes. You can, however, continue to keep reading for input at the terminal again if desired.

For something like flate2, though, it's definitely a bug to hit EOF before the actual end of file! Note, though, that gzip has done trick things in the past which aren't necessarily a bug in flate2. You can actually literally concatenate a bunch of *.gz files into another *.gz file and when you run it through gunzip it'll decompress each stream separately. By default flate2 expects the entire stream to be one gz stream. If the file you're decompressing is a bunch of smaller files concatenated, that'd at least explain why it looks like it's hitting EOF early. In flate2 there should be facilities for handling this, it just needs to be handled explicitly.

1 Like

Hi!

Thanks, I'll definitely file a bug report with a reproducible test case! The file giving me the issue is a Wikipedia dump: https://dumps.wikimedia.org/wikidatawiki/entities/20161212/
Unfortunately, I downloaded the file in January and I guess it's no longer available on the site. I have to see if I can reproduce it with the latest dump.

By the way, how can I see if a give gz file contains multiple streams?

Cheers,
Radu

IIRC wikis were one of those sources of concatenated gzip streams, so that may actually be what's happening here. In theory you can detect that because flate2 will return EOF but you won't actually be EOF. You can find some more information about handling in this issue: Premature EOF on WARC files · Issue #41 · rust-lang/flate2-rs · GitHub

I guess this explains it.

When I extract a subset of lines from the big file and compress it again as gz, everything works fine.

Well, this seems a bit harsh. RFC 1952 describes the GZIP file format and in section 2.2 it clearly states:

So a multi-member gzip file is not a trick but a valid alternative to a single stream.