Rust beginner notes & questions

First of all, I have to say that ripgrep is impressive work!! I've used it just recently because it smokes everything else if you need to trawl through gigabytes of data for a keyword.

The whole argument I've been trying to clumsily make above is that your hard work for things like BOM detection and encoding switching should have been built-in to Rust and not a part of the ripgrep codebase. At the end of the day, what you've written is a single-purpose tool, but large chunks of its codebase looks general-purpose to me. That is the "code smell" that I'm concerned about. It indicates to me that the Rust library has too many gaps, and people have to reinvent wheels all over the place. Incompatible wheels.

If anything, your effort confirms my argument. E.g.:

https://github.com/BurntSushi/ripgrep/blob/b38b101c77003fb94aaaa8084fcb93b6862586eb/src/decoder.rs#L122-L126

If Read was a trait with a type parameter, this would not be an issue, because you could only ever read a whole number of u16 UCS codepoints out of something like Read<u16>!

You had to write about 300 lines of fairly complex code which I don't believe is zero-copy. It looks like it's making 2-3 copies when processing UCS-16, and probably at least 1 or 2 even with UTF-8 but I'm not sure. The Read trait that is inherently copy-based, so I don't think there's any way to avoid at least 1 copy.

I my imagination, an ideal API should support the most complex, worst-case scenario with the best possible performance. If it can do that, then everything simpler should just "fall in place" and developers like you would not have to reinvent wheels such as BOM detection and encoding switching.

As a worst-case example, imagine that someone wants to decode something hideous, such as:

  • An XML stream that may be in a variety of encodings. The standards-compliant way of doing this can involve reading dozens of bytes into the stream: Extensible Markup Language (XML) 1.0 (Fifth Edition)
  • The source is a forward-only stream (e.g.: an encrypted or compressed).
  • The source is being fed in by a user-mode network library, such as from a high-performance RDMA network driver (common with Infiniband or 40 Gbps Ethernet). To enable zero-copy, you can't provide a buffer during the read() call. Instead, a large pool of buffers must be registered for use by the network stack up-front and then consumed by your code and returned to the pool.
  • The XML contains huge chunks of Base64 encoded binary blobs that are potentially too big to fit into memory. You'd have to stream these out into a destination stream during decoding.
  • The rest of the XML contains millions of small strings (element names) and integer values (element contents) that you do not want to heap allocate during decoding. It's sufficient to simply compare the names against constant str values and decode the integers directly to i32 values. (e.g.: if xml.node_name == "foo" { ... } ).
  • You want to do all of this without reinventing the wheel at every step. E.g.: the base64 decoding for XML ought to be the same as base64 decoding used everywhere else.

The new C# Pipelines API is targetted at this kind of scenario. I looked at tokio as @vitalyd suggested, but it's still doing permanently limiting things, such as advancing the stream on read_buf() and assuming that the underlying streams are made up of bytes. Interestingly, they've gone half-way with the BufMut trait, but that's still very byte-centric and will likely not work well with things like text streams.

So for example, imaging you're flying along, decoding the base64 data in nice 1MB buffer chunks or whatever and you discover that 732KB into the buffer you've just been given is the end of the binary data. The remaining 292KB is XML. Now what? Stuff the unconsumed data back into the previous stream level?

This is why the C# Pipelines API doesn't consume buffers automatically, because then the base64 decoder can simply mark 732KB as consumed, mark itself as finished, and then the outer XML decoder can continue with the remaining 292KB. This is both smoother for the developer, and faster at runtime. You've already had to muck about with (thankfully small) buffers in ripgrep to do BOM detection. This can get much worse in more complex scenarios. Think 5-7 layers of decoder nesting, not just 1-2.

These tiny API design decisions can have huge ramifications down the track. Hence my disappointment with things like Read::read_to_string(). It shows that very minor short-term convenience won out over design that can last into the future.

Before people chime in and complain that I'm just inventing unrealistic scenarios, imagine trying to extend ripgrep to support searching through text in OOXML documents such as Word DOCX or Excel XLSX documents. These are potentially very large (>1GB), compressed via Zip, and can be encoded with either UTF-8 or UTF-16. Internally, the XML files can be split into "parts", which are like Zip files split into multiple archives. A compliant decoder has to be able to: append streams, decode forward-only, do XML encoding detection, and stitch together XML text fragments into a single "character stream" to do matching on.

Now imagine writing a high-performance "virtual appliance" that does regular-expression based "data loss prevention" scanning of documents passing through it at 40 Gbps. In principle, this is not all that different to the ripgrep use-case, and the code ought to look similar.

1 Like