Idiomatic decoder as iterator

I am trying to write a decoder whose number of consumed bytes is unknown until the values are decoded.

Assume that we have bytes &[u8] on which some unknown number of initial bytes represent multiple u32 (under some encoding) and the remaining should be interpreted differently. A simple example would be when the bytes represent ULEB128-encoded numbers.

I would like to be able to write a decoder that implements Iterator<Item=u32>, that, when consumed, e.g. via decoder.collect<Vec<u32>>(), decoder.consumed_bytes() yields the number of bytes consumed. I can't compute this number without decoding the bytes.

My pain atm is that because the ownership is passed on .collect(), I can't call decoder.consumed_bytes() after decoder.collect<Vec<u32>>().

Is there an idiom that enables this behavior?

I could make the iterator Item<(u32, consumed)>, I am trying to see if there was another way out, as that is less convenient to consumers that do not want to use the consumed bytes.

The other alternative here is to not use iterators by io::Read and io::Write.


Context:

In Apache parquet format, a popular format in the analytics, the delta byte array encoding of &["Hello", "World"] is

let data = &[128u8, 1, 4, 2, 0, 0, 0, 0, 0, 0, 128, 1, 4, 2, 10, 0, 0, 0, 0, 0, 72, 101, 108, 108, 111, 87, 111, 114, 108, 100]

It is composed by 3 parts:

  • [128u8, 1, 4, 2, 0, 0, 0, 0, 0, 0] is the delta-encoding of [0,0], the length of the common prefix of "Hello" and "World" (no common prefixes => 0)
  • [128, 1, 4, 2, 10, 0, 0, 0, 0, 0] is the delta-encoding of the lengths, [5, 5]
  • [72, 101, 108, 108, 111, 87, 111, 114, 108, 100] is "HelloWorld".

I would like to offer an iterator over values ("Hello", "World") in this case. Since we do not know the size of the first two parts without iterating on them, I must persist their decodings somerwhere. I already have an iterator that yields the 0, 0, but I need to augment it to also return the consumed bytes after it finished, so that I know where the 2nd part begins.

If you call decoder.by_ref().collect<...>() instead, it won’t consume decoder, so it’s still available to call consumed_bytes.

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.