The best interface for transforming byte streams?

Hi all, Rust newcomer here. I am trying to find the best interface for defining composable transformations on byte streams. A decoder of a hex-encoded byte stream is a good example because it is simple but has the following features in common with more involved examples:

  • A possibility of invalid input (such as characters other than 0-9,a-f or streams of odd length) necessitates non-trivial error handling.
  • It is not a simple byte-for-byte mapping, several bytes (2 in this case) must be grouped together to produce the next output byte, optionally whitespace characters should be eliminated.

So far I have come up with two approaches: implement the std::io::Read trait or the Iterator trait. I'll discuss readers first.

While implementing hex decoder as a reader adapter I encountered the following passage in the documentation:

If an error is returned then it must be guaranteed that no bytes were read.

That passage is a bit unclear to me. Does that apply only to transient errors? What if my reader adapter calls its underlying reader multiple times, generates some output and then encounters an error? Should it discard the error and just return the number of bytes read so far? Or is such behavior prohibited? Note that in contrast for Go readers reading some bytes and simultaneously returning an error is acceptable.

If a reader adapter conforms to "no partial failures" rule it should issue no more than one call to the underlying reader. Then the following problem appears: what if only 1 byte is read? Adapter must not return Ok(0), as that would signify EOF. One solution that I devised was to return io error with std::io::ErrorKind::Interrupted but that feels kind of hacky.

Implementing hex decoder as an iterator brings its own set of questions. First, to retain meaningful error information I must set Item=Result<u8, E> for some E. But then composing iterators becomes rather cumbersome. Also there is no easy way for a caller to distinguish between recoverable errors which the one can try to iterate over from unrecoverable ones. One solution that I can think of is to set some flag after an unrecoverable error and return None on the subsequent calls to .next(). Is there a better approach?

Final question concerns interoperability between these two approaches. A reader can be transformed into a byte stream iterator with a call to .bytes(). Why isn't the reverse transformation (i.e. implementing std::io::Read for byte stream iterators) defined in the standard library? It seems straightforward and fairly useful but I can't do it myself since neither the trait nor the type is defined by me.

Yes. If it's a transient error, so no harm no foul; if it's a permanent error, calling read again should just return the error again. If this makes you uncomfortable (or you have some odd situation where the error would be ignored indefinitely), you could always just stash the error and yield it on the next call to read but that's not really necessary.

Personally, I'd either block (I don't know if this would work with your interface) or return WouldBlock.

This is actually not uncommon. See std::fs::ReadDir.

I haven't noticed this being cumbersome in practice. However, I have to admit that I have wanted Iterator::map_ok at times.

If you want to use the reader, just don't convert it to an iterator. I can't think of any case where someone might convert a reader to an iterator and then want to convert it back. If you just want to borrow the reader as an iterator, Read is implemented on &mut R where R: Read so you can just call:

    let mut bytes = (&mut my_reader).bytes();
    /* do stuff */
1 Like

Custom adapters can help a lot there. For example, Itertools adds a fold_results method to all iterators.

I've needed this recently and, as a result, published the iter-read crate.


Out of curiosity, could you tell me what your use case was?

For my serde-pickle library I wanted to provide approximately the same interface as serde_json, but I built the deserializer upon Read and not Iterator<Result<u8>>. To get a from_iter I needed to have a Read -> Iterator adapter.

Ah. Now that makes sense.

Well I have finally produced the code that is kind of satisfactory to me. Here it is in case anyone is interested: Rust Playground

HexDecoder implements std::io::Read. It does buffering similarly to std::io::BufReader but with a twist: if after a read from the inner reader there are still not enough bytes to proceed, it will try to read more. For example if the inner reader is a slow TcpStream with bytes trickling one by one the read will block until there are 2 bytes available.

No bytes are consumed if the underlying reader reports an error. If there is an invalid byte in the stream, it will read up to that byte and then report the error. Unexpected EOFs are reported too.

I have already implemented that feature plus plenty more in GitHub - abonander/buf_redux: A drop-in replacement for Rust's std: :BufReader, with extra features. Have a look!

Okay, so I don't implement that feature exactly as it's a little too specific for a general buffer API, but .available() and .read_into_buf() make it pretty easy to implement yourself with a lot more control over the retry attempts than an opaque method could provide.

When specialization stabilizes I plan to use it to optimize buffer allocation by skipping zeroing for trusted reader types, like the ones from the stdlib.

1 Like

There's a trait (BufReadGrow) for this in my netio crate.
It provides exactly the same semantics that you mentioned. The entire crate designed not to consume/lose any data on error.

The crate depends on @DroidLogician's excellent buf_redux crate, providing some useful functions on top of it. But it's not limited to the BufReader from buf_redux, it also implements that functionality for useful adapters like Take.

1 Like

@troplin @DroidLogician Thanks guys! Your code is very instructive. Much easier to implement correct readers using these helpers than with bare standard library (in fact I was a bit surprised that something like them is not present in the stdlib).

With being so easy to use I think it's reasonable to keep libstd small and only add the most important traits (for interoperability). It's easy to add something to the standard library, but almost impossible to remove or correct something.

BTW, I hope we'll see your code on too, one day. :wink: