Clarification of std::io::Read (with multi-stream gzip files and flate2)

radupopescu · December 11, 2016, 10:49pm

Hello,

I'm a beginner with Rust and I'm trying to understand how to properly use the fn read(&mut self, buf: &mut [u8]) -> Result<usize> method of io::Read.

For what it's worth, in my program, this method is implemented by flate2::bufread::GzDecoder. I'm reading large gz archives, streaming lines one at a time. There are cases when the method returns Ok(0) before the entire file has been read. I'm having some trouble understanding the contract:

If the return value of this method is Ok(n), then it must be guaranteed that 0 <= n <= buf.len(). A nonzero n value indicates that the buffer buf has been filled in with n bytes of data from this source. If n is 0, then it can indicate one of two scenarios:

This reader has reached its "end of file" and will likely no longer be able to produce bytes. Note that this does not mean that the reader will always no longer be able to produce bytes.

The buffer specified was 0 bytes in length.

In the first case, it says that the reader could later again be able to produce bytes. How can you tell when that time comes? Do we need to implement some polling? But how can you tell when the reader is no longer able to produce bytes, "for real" this time?

Thank you!

Cheers,
Radu

leodasvacas · December 12, 2016, 6:03pm

The implementor of Read is free to do whatever it pleases, so, in general, you can't tell. I assume std::fs::File implements Read::read with a POSIX read on Linux or Mac, so if you are reading a file from the start it will only return 0 once you reach the end of the file and from then on it will always return 0 if reading is all you are doing. So it's always "for real". Unless you later do something like using File::seek to repositon the file cursor then read may again be able to read the file.

Getting notified when a file is ready for reading is async I/O territory, which the Rust standard library dosen't do, so it's not something that should concern you when just reading a file using stdlib.

radupopescu · December 12, 2016, 6:16pm

Alright, thanks for the clarification!

jorendorff · December 13, 2016, 4:44am

There are cases when the method returns Ok(0) before the entire file has been read.

Hmm. I think that sounds like a serious bug in flate2. Do you have example input that definitely causes this behavior? If so, please consider filing an issue.

/cc: @alexcrichton

Update: There's a comment in the flate2 source that seems to agree that this would be a bug. I've reviewed the code and I honestly can't see (yet) how it could be wrong. A test case would help.

jorendorff · December 13, 2016, 4:47am

(Bugs like this are not super rare. read seems straightforward, but it's deceptively tricky to use. Python's stdlib had a funny read bug.)

radupopescu · December 13, 2016, 11:38am

I wanted to be sure that I properly understand the API before drawing any conclusions. I also think that the problem is elsewhere.

@jorendorff Thanks for the link to the source code comment. I'll try to setup a very simple test case to reproduce this and report an issue. Unfortunately, in my own playground program, I'm only seeing this bug for the largest input file (~ 5.8 GB).

Cheers!

alexcrichton · December 13, 2016, 5:33pm

Oh dear, this does indeed sound like a bug! I'd definitely appreciate a bug report

In general though some I/O objects can hit EOF multiple times, but AFAIK that's limited to terminals. Each time you hit ctrl-d at a terminal it'll make any currently blocking read return with a value of 0 bytes. You can, however, continue to keep reading for input at the terminal again if desired.

For something like flate2, though, it's definitely a bug to hit EOF before the actual end of file! Note, though, that gzip has done trick things in the past which aren't necessarily a bug in flate2. You can actually literally concatenate a bunch of *.gz files into another *.gz file and when you run it through gunzip it'll decompress each stream separately. By default flate2 expects the entire stream to be one gz stream. If the file you're decompressing is a bunch of smaller files concatenated, that'd at least explain why it looks like it's hitting EOF early. In flate2 there should be facilities for handling this, it just needs to be handled explicitly.

radupopescu · December 13, 2016, 5:50pm

Hi!

Thanks, I'll definitely file a bug report with a reproducible test case! The file giving me the issue is a Wikipedia dump: https://dumps.wikimedia.org/wikidatawiki/entities/20161212/
Unfortunately, I downloaded the file in January and I guess it's no longer available on the site. I have to see if I can reproduce it with the latest dump.

By the way, how can I see if a give gz file contains multiple streams?

Cheers,
Radu

alexcrichton · December 13, 2016, 6:12pm

IIRC wikis were one of those sources of concatenated gzip streams, so that may actually be what's happening here. In theory you can detect that because flate2 will return EOF but you won't actually be EOF. You can find some more information about handling in this issue: Premature EOF on WARC files · Issue #41 · rust-lang/flate2-rs · GitHub

radupopescu · December 13, 2016, 6:47pm

I guess this explains it.

When I extract a subset of lines from the big file and compress it again as gz, everything works fine.

veldsla · December 13, 2016, 7:21pm

Well, this seems a bit harsh. RFC 1952 describes the GZIP file format and in section 2.2 it clearly states:

2.2. File format

  A gzip file consists of a series of "members" (compressed data
  sets).  The format of each member is specified in the following
  section.  The members simply appear one after another in the file,
  with no additional information before, between, or after them.

So a multi-member gzip file is not a trick but a valid alternative to a single stream.

Topic		Replies	Views
std::fs::File::read() does return Ok(0) for a correct Text File help	4	1395	September 29, 2019
Handling EOF of Read::read	3	1231	October 31, 2020
How to read gzip binary file? help	4	3438	December 15, 2022
File I/O giving me a headache help	2	125	April 1, 2025
Why file.read() is reading file by 2 bytes at a time?	3	1936	January 12, 2023

Clarification of std::io::Read (with multi-stream gzip files and flate2)

Related topics