Rust beginner notes & questions

It really doesn't. The transcoding is itself handled by a separate crate, and the shim itself isn't specific to ripgrep and could be lifted into a separate crate. Any enterprising individual could accomplish that. ripgrep used to be much more monolithic, and I've been steadily moving pieces out into separate crates. The UTF-16 shim is one such candidate for moving into a separate crate, but nobody has put in the work to do it.

That's false. UTF-16 is a variable width encoding (not all Unicode codepoints are representable via a single UTF-16 code unit), and I still need to transcode it to UTF-8 in order to search it. The regex engine could natively support UTF-16, but that has nothing to do with the definition of the Read trait and is a huge complication for very little gain. It's much simpler to just transcode.

Which, again, could be shared with some effort. This is the premise of the Rust ecosystem: a small std library with a very low barrier to using crates in the ecosystem.

No. The shim is doing buffered reading. Specifically, if the shim is wrapped around a fs::File, then:

  1. UTF-16 encoded bytes are copied to an internal buffer directly from a read syscall (kernel to user).
  2. Transcoding is performed from the bytes in the internal buffer to the caller's buffer directly.

A perusal of the code makes it look like an additional copy is happening, but in practice, this copy is just rolling a small number of bytes from the end of the buffer to the beginning of the buffer that either couldn't fit in the caller's buffer or represent an incomplete UTF-16 sequence.

No. The Read trait is just an OS independent interface that loosely describes how to read data. For example, when reading from a File, the buffer provided to the read method is going to be written to directly by the OS. That's as little possible copying as you can do. To do better, you need to go into kernel land or use memory maps.

You're conflating concepts here. The additional copying is only necessary because I'm doing transcoding and because I wanted buffered reading. The extra copy from the transcoding could be avoided if the regex engine supported searching UTF-16 encoded bytes directly, but it doesn't. And again, this has nothing at all to do with the Read trait and everything to do with implementation details of how the regex engine was built.

(The extra copy here is also a red herring. The transcoding itself is the bottleneck.)

But ripgrep already does this, because Read implementations are composable:

$ cat sherlock
For the Doctor Watsons of this world, as opposed to the Sherlock
Holmeses, success in the province of detective work must always
be, to a very large extent, the result of luck. Sherlock Holmes
can extract a clew from a wisp of straw or a flake of cigar ash;
but Doctor Watson has to have it taken out for him and dusted,
and exhibited clearly, with a label attached.
$ iconv -f UTF-8 -t UTF-16 sherlock > sherlock-utf16

$ rg Watson sherlock
1:For the Doctor Watsons of this world, as opposed to the Sherlock
5:but Doctor Watson has to have it taken out for him and dusted,

$ rg Watson sherlock-utf16
1:For the Doctor Watsons of this world, as opposed to the Sherlock
5:but Doctor Watson has to have it taken out for him and dusted,

$ gzip sherlock-utf16
$ rg -z Watson sherlock-utf16.gz
1:For the Doctor Watsons of this world, as opposed to the Sherlock
5:but Doctor Watson has to have it taken out for him and dusted,

How do you think this works? There's a shim for doing gzip decompression, just like for UTF-16 transcoding. These shims don't know about each other but compose perfectly fine. This is the first time I've even bothered to try searching gzip compressed UTF-16, and it "just worked."

Yes, ripgrep contains these shims, but that's just because nobody has productionized them. This doesn't mean Rust's standard library has to do it, or even that the Read trait needs to change for this to happen. Somebody just needs to put in the work, and that's true regardless of whether it lives in std or in a crate.

I don't see any reason why the presence of read_to_string prevents the use cases you're talking about.

There are certainly a lot of moving parts here, but I don't see any reason why the Rust ecosystem isn't well suited to solve a problem like this. The interesting bits are building compliant decoders and supporting routines that can search character streams (which in the general case is always going to be slow). The Read trait isn't going to prevent you from doing that.

13 Likes