I have some data that is supposed to be UTF-8 encoded available as chunks of bytes (for simplicity, let's say Iterator<Item=Vec<u8>>, although I actually have ByteStream from aws_sdk_s3)).
I want to break it into lines available as an Iterator<Item=String>. How can I do that?
I'd expect some solution to already exist , but I could not find any. Does any one know one?
I could scan the data for line ends, break it into Vec<u8>, and then use String::from_utf8 or String::from_utf8_lossy. But that means scanning the data twice (one for line ends, and once for UTF-8 check), which seems wasteful.
I could scan the data for UTF-8 codepoints and add them to a String until I find a line break. This scans the data only once, but copies it in small portions, which seems also wasteful
I could scan the data for UTF-8 correctness until I find a line break, and then copy a whole chunk to a Vec<u8>, which I then unsafely transform into a String. That seems to be the most efficient, but at the expense of using unsafe.
even so, they do scan contents twice, at least on a chunk-by-chunk basis1
I’m not sure whether or not that’s actually optimal, however it seem:
it can re-use the existing, likely highly-optimized utf8 validity checker function this way, and it’s presumably easier to do this than try to write a forked version of the checker that also searches for newlines, and maintain that… just guessing at the motivation though
the other scan for newline can use the highly-optimized system function memchr, so it should be really efficient anyway
anyway, yeah, that validity checker does not look like something a “search for newline” couldn’t be integrated into, too, without too much – …actually now I’m noticing the special ascii-handling loop in that checker; perhaps that one would get worse by a newline check? IDK. My best guess is they thought “fast enough”, especially under the assumption that the handling of one chunk at a time.
1upon re-reading the source code, now I’m noticing that it’s actually a “line at a time”, not “chunk at a time” (which makes sense, because a str::from_utf8 check can more easily be applied to the whole line, as it won’t be cut up in the middle of any character).
Still, I would guess the working assumption here is that each line will often be rather short in practice; if that’s the case, re-scanning twice at least doesn’t waste too much memory bandwidth when things should already be in cache.
Nonetheless: Trying to learn from std here leaves me also with the conclusion that there may be room for improvement even compared to std’s approach for BufRead::Lines/BufRead::read_line, I guess. (Also it seems unlikely this will be a bottleneck in practice, as usually you’re doing more to the String afterwards than just cutting it up at the \n and verifying its UTF-8; so IMO, this also calls potential premature optimization.)
Interesting follow-up exercise: Look at what tokio::io::AsyncBufReadExt::lines does for further comparison – also that one supports async, so with you mentioning aws_sdk_s3’s ByteStream, which seems to involve async anyway, it’s perhaps even adaptable to use here directly?
so with that, tokio’s lines method, and perhaps additionally wrapping with tokio_stream::LinesStream::new, you can directly build a Stream<Item = String>, i.e. the current best async equivalent of an Iterator<Item = String>.