Streaming character codecs


#1

Continuing the discussion with @jan_hudec about my project for streamed character codecs here on the users forum, as the crate project has little to do with Rust internals.

I certainly consider iconv as one possible implementation of Decode and a future Encode.
A goal of the current design is to allow largely non-copying decoders when the source encoding is a subset of UTF-8, such as UTF-8 itself or ASCII. In these cases, the decoder’s job is to validate the byte stream; but to consume a buffered reader without extra allocations, occasionally a partial UTF-8 sequence needs to be saved in the decoder state to assemble the complete UTF-8 sequence across adjacent buffered reads.

The code currently in master is already outdated; bear with me while I’m sitting on some pending changes, which need some borrow checker issues to be resolved.


#2

I actually wonder if it might be better to represent these as iterator adapters - also, I wonder about cases where a codec might represent as one codepoint something that requires multiple Unicode codepoints for the grapheme.


#3

I find the iterator-based design problematic: it’s not possible to iterate through a reader without dragging an io::Result everywhere. This gives some performance concerns and also makes the decoder trait dependent on std::io. Whereas in my current implementation the decoder trait works with slices and internal buffers, provides a specialized error and can be used outside of the I/O framework.

Emitting multiple code points is totally fine with internal buffering; it’s even possible to emit on end of input.

For the writer side, I want to support formatted writing, which is also not iterator-oriented.


#4

Also, the WHATWG encoding spec requires lookahead-y algorithms for error recovery modes, which practically requires buffering.


#5

I believe all such cases are either a bounded (small) amount of look-ahead (so you can have a fixed-sized buffer of a few bytes) or can be written as something equivalent that does do any look ahead.