Rust beginner notes & questions

ripgrep supports searching either UTF-8 or UTF-16 seamlessly, via BOM sniffing. My Windows users appreciate this. The search implementation itself only cares about getting something that implements io::Read. UTF-16 handling works by implementing a shim for io::Read that transcodes UTF-16 to UTF-8. I did this in about less than a day's worth of a work and it was well worth it.

19 Likes

Ironically, I had the opposite experience when I did something similar a couple of years ago. What I looked up was the implementation for Box, expecting to see something quite simple and similar to C++ auto_ptr or unique_ptr. Perhaps if I looked again now that I have more experience, I would feel differently but at the time I felt that a significant amount of magic was being performed for what in C++ was quite straightforward.

1 Like

I suspect much of that magic is linked to support for Box<Trait>.

1 Like

Small off-topic: why it was decided to join Box<SizedType>, Box<[T]> and Box<Trait> under the single Box roof instead of using three separate types for each use-case, e.g. something like Dyn<Trait>for trait objects?

This sounds like the core of your concerns, and to be honest, I think it is definitely an area Rust could do better at. Maybe next year we need to have a Windows domain working group to put some serious effort behind it. I think the challenge is that it seems like Rust / open source developers gravitate more towards Mac / Linux than your average desktop user so the numbers are skewed.

Who knows, maybe one day Microsoft themselves might help out.

2 Likes

First of all, I have to say that ripgrep is impressive work!! I've used it just recently because it smokes everything else if you need to trawl through gigabytes of data for a keyword.

The whole argument I've been trying to clumsily make above is that your hard work for things like BOM detection and encoding switching should have been built-in to Rust and not a part of the ripgrep codebase. At the end of the day, what you've written is a single-purpose tool, but large chunks of its codebase looks general-purpose to me. That is the "code smell" that I'm concerned about. It indicates to me that the Rust library has too many gaps, and people have to reinvent wheels all over the place. Incompatible wheels.

If anything, your effort confirms my argument. E.g.:

https://github.com/BurntSushi/ripgrep/blob/b38b101c77003fb94aaaa8084fcb93b6862586eb/src/decoder.rs#L122-L126

If Read was a trait with a type parameter, this would not be an issue, because you could only ever read a whole number of u16 UCS codepoints out of something like Read<u16>!

You had to write about 300 lines of fairly complex code which I don't believe is zero-copy. It looks like it's making 2-3 copies when processing UCS-16, and probably at least 1 or 2 even with UTF-8 but I'm not sure. The Read trait that is inherently copy-based, so I don't think there's any way to avoid at least 1 copy.

I my imagination, an ideal API should support the most complex, worst-case scenario with the best possible performance. If it can do that, then everything simpler should just "fall in place" and developers like you would not have to reinvent wheels such as BOM detection and encoding switching.

As a worst-case example, imagine that someone wants to decode something hideous, such as:

  • An XML stream that may be in a variety of encodings. The standards-compliant way of doing this can involve reading dozens of bytes into the stream: Extensible Markup Language (XML) 1.0 (Fifth Edition)
  • The source is a forward-only stream (e.g.: an encrypted or compressed).
  • The source is being fed in by a user-mode network library, such as from a high-performance RDMA network driver (common with Infiniband or 40 Gbps Ethernet). To enable zero-copy, you can't provide a buffer during the read() call. Instead, a large pool of buffers must be registered for use by the network stack up-front and then consumed by your code and returned to the pool.
  • The XML contains huge chunks of Base64 encoded binary blobs that are potentially too big to fit into memory. You'd have to stream these out into a destination stream during decoding.
  • The rest of the XML contains millions of small strings (element names) and integer values (element contents) that you do not want to heap allocate during decoding. It's sufficient to simply compare the names against constant str values and decode the integers directly to i32 values. (e.g.: if xml.node_name == "foo" { ... } ).
  • You want to do all of this without reinventing the wheel at every step. E.g.: the base64 decoding for XML ought to be the same as base64 decoding used everywhere else.

The new C# Pipelines API is targetted at this kind of scenario. I looked at tokio as @vitalyd suggested, but it's still doing permanently limiting things, such as advancing the stream on read_buf() and assuming that the underlying streams are made up of bytes. Interestingly, they've gone half-way with the BufMut trait, but that's still very byte-centric and will likely not work well with things like text streams.

So for example, imaging you're flying along, decoding the base64 data in nice 1MB buffer chunks or whatever and you discover that 732KB into the buffer you've just been given is the end of the binary data. The remaining 292KB is XML. Now what? Stuff the unconsumed data back into the previous stream level?

This is why the C# Pipelines API doesn't consume buffers automatically, because then the base64 decoder can simply mark 732KB as consumed, mark itself as finished, and then the outer XML decoder can continue with the remaining 292KB. This is both smoother for the developer, and faster at runtime. You've already had to muck about with (thankfully small) buffers in ripgrep to do BOM detection. This can get much worse in more complex scenarios. Think 5-7 layers of decoder nesting, not just 1-2.

These tiny API design decisions can have huge ramifications down the track. Hence my disappointment with things like Read::read_to_string(). It shows that very minor short-term convenience won out over design that can last into the future.

Before people chime in and complain that I'm just inventing unrealistic scenarios, imagine trying to extend ripgrep to support searching through text in OOXML documents such as Word DOCX or Excel XLSX documents. These are potentially very large (>1GB), compressed via Zip, and can be encoded with either UTF-8 or UTF-16. Internally, the XML files can be split into "parts", which are like Zip files split into multiple archives. A compliant decoder has to be able to: append streams, decode forward-only, do XML encoding detection, and stitch together XML text fragments into a single "character stream" to do matching on.

Now imagine writing a high-performance "virtual appliance" that does regular-expression based "data loss prevention" scanning of documents passing through it at 40 Gbps. In principle, this is not all that different to the ripgrep use-case, and the code ought to look similar.

1 Like

It really doesn't. The transcoding is itself handled by a separate crate, and the shim itself isn't specific to ripgrep and could be lifted into a separate crate. Any enterprising individual could accomplish that. ripgrep used to be much more monolithic, and I've been steadily moving pieces out into separate crates. The UTF-16 shim is one such candidate for moving into a separate crate, but nobody has put in the work to do it.

That's false. UTF-16 is a variable width encoding (not all Unicode codepoints are representable via a single UTF-16 code unit), and I still need to transcode it to UTF-8 in order to search it. The regex engine could natively support UTF-16, but that has nothing to do with the definition of the Read trait and is a huge complication for very little gain. It's much simpler to just transcode.

Which, again, could be shared with some effort. This is the premise of the Rust ecosystem: a small std library with a very low barrier to using crates in the ecosystem.

No. The shim is doing buffered reading. Specifically, if the shim is wrapped around a fs::File, then:

  1. UTF-16 encoded bytes are copied to an internal buffer directly from a read syscall (kernel to user).
  2. Transcoding is performed from the bytes in the internal buffer to the caller's buffer directly.

A perusal of the code makes it look like an additional copy is happening, but in practice, this copy is just rolling a small number of bytes from the end of the buffer to the beginning of the buffer that either couldn't fit in the caller's buffer or represent an incomplete UTF-16 sequence.

No. The Read trait is just an OS independent interface that loosely describes how to read data. For example, when reading from a File, the buffer provided to the read method is going to be written to directly by the OS. That's as little possible copying as you can do. To do better, you need to go into kernel land or use memory maps.

You're conflating concepts here. The additional copying is only necessary because I'm doing transcoding and because I wanted buffered reading. The extra copy from the transcoding could be avoided if the regex engine supported searching UTF-16 encoded bytes directly, but it doesn't. And again, this has nothing at all to do with the Read trait and everything to do with implementation details of how the regex engine was built.

(The extra copy here is also a red herring. The transcoding itself is the bottleneck.)

But ripgrep already does this, because Read implementations are composable:

$ cat sherlock
For the Doctor Watsons of this world, as opposed to the Sherlock
Holmeses, success in the province of detective work must always
be, to a very large extent, the result of luck. Sherlock Holmes
can extract a clew from a wisp of straw or a flake of cigar ash;
but Doctor Watson has to have it taken out for him and dusted,
and exhibited clearly, with a label attached.
$ iconv -f UTF-8 -t UTF-16 sherlock > sherlock-utf16

$ rg Watson sherlock
1:For the Doctor Watsons of this world, as opposed to the Sherlock
5:but Doctor Watson has to have it taken out for him and dusted,

$ rg Watson sherlock-utf16
1:For the Doctor Watsons of this world, as opposed to the Sherlock
5:but Doctor Watson has to have it taken out for him and dusted,

$ gzip sherlock-utf16
$ rg -z Watson sherlock-utf16.gz
1:For the Doctor Watsons of this world, as opposed to the Sherlock
5:but Doctor Watson has to have it taken out for him and dusted,

How do you think this works? There's a shim for doing gzip decompression, just like for UTF-16 transcoding. These shims don't know about each other but compose perfectly fine. This is the first time I've even bothered to try searching gzip compressed UTF-16, and it "just worked."

Yes, ripgrep contains these shims, but that's just because nobody has productionized them. This doesn't mean Rust's standard library has to do it, or even that the Read trait needs to change for this to happen. Somebody just needs to put in the work, and that's true regardless of whether it lives in std or in a crate.

I don't see any reason why the presence of read_to_string prevents the use cases you're talking about.

There are certainly a lot of moving parts here, but I don't see any reason why the Rust ecosystem isn't well suited to solve a problem like this. The interesting bits are building compliant decoders and supporting routines that can search character streams (which in the general case is always going to be slow). The Read trait isn't going to prevent you from doing that.

13 Likes

How would you implement this for a Read over f32? What if you're trying to treat incoming measurement data as a stream of numbers, e.g.: for DSP-style programming? You can always use unimplemented!() or panic!(), but that's really icky because then libraries all over the place will have to include code that can crash the process.

Or, you could always reinvent the general concept of streams for you special case.

Either way, eww...

There are certainly a lot of moving parts here...

There doesn't have to be!

Your 300 lines of BOM-peeker code never needed to exist in the first place. If Read was designed more like System.IO.Pipelines and separated the "get a buffer" and "consume input items" concepts then the BOM detection code would be hilariously trivial.

It would look vaguely like the following:

fn bom_detection<R:Read2> ( source: R ) -> impl Read2 {
    // The reader provides the buffer, and we specify the *minimum* required elements (bytes).
    if let Ok(buf) = source.read( 3 ) {
        match buf {
            // variable number of bytes can be consumed!
            [0xEF,0xBB,0xBF] => { source.consume(3); source },
            [0xFE, 0xFF,_]   => { source.consume(2); UCS16BigEndianConverter::new( source ); }
            [0xFF, 0xFE,_]   => { source.consume(2); UCS16LittleEndianConverter::new( source ); }
            _ => source // pass-through works trivially because we're not forced to consume any of the bytes!
        }
    }
}

Note the similarity with PEG-based parsers such as the nom crate, which use a similar "peek-and-consume-if-matched" pattern.

Similarly, zero-copy I/O doesn't have to be complicated, but the only way to do that is for read() to provide the buffer to the consumer instead of the consumer passing in a buffer to be filled. These are fundamentally opposite concepts, and the latter can never support the zero-copy scenario in the general-case.

Your ripgrep utility gets to cheat with the special case of memory-mapping files, but this just doesn't work for network streams.

Think about it: where is the network stack going to put the data in between read() calls if read() is providing it the buffers... one at a time? The only way to do this is to give the network stack a bunch of buffers that it can fill itself. The user-mode code can then consume some of the buffers, process them, and return them to the pool while the rest of the buffers are being filled behind the scenes by RDMA.

It's a push-vs-pull API difference that can never be reconciled. You have to do one or the other. This should have been foreseen, but wasn't. The Windows Network Direct API has been around since 2010, and IIRC Linux actually beat them to it by several years because of the pervasive use of this type of programming in HPC clusters. Both revolve around providing a pool of buffers up front.

I'm not saying that the Rust team needed to implement HPC RDMA I/O from day #1, but a tiny bit of foresight is all it takes. The difference is literally 1 vs 2 fn-s in the Read trait.

So all I'm saying is that the design of "all streams can be thought of as copying into a client-provided byte buffer, which is basically UTF8 enough of the time for this convenience method to be present" is just wrong. No amount of wishful thinking will ever make this the general case. Meanwhile, the general case supports that scenario smoothly and integer token streams being produced by lexers, and f32 streams being passed to DSPs, and RDMA at 100Gbps, and so on, and so forth...

I think one key point of the discussion here is that Rust provides a thin runtime which is close to the lowest-common-denominator OS API (in this case POSIX's byte-oriented IO), whereas C# provides a thick runtime which is close to programmer use cases (in this case structured IO). As much as I dislike this dichotomy, this is the textbook difference between a low-level programming language (annoying but predictable) and a high-level programming language (comfy but uncontrollable).

Now, of course, we could dream of an ideal world in which our programmer comfort would not be disturbed by crufty API design from the 70s. But if we cannot get that, the next best choice is to cater to both application devs and system devs using different tools. I would say that this is why having both Rust and C# is a good thing.

10 Likes

None of my examples with ripgrep used memory maps, so I don't know why you're bringing that up. My shim for transcoding doesn't assume the presence of a caller provided buffer, but it could and the code would be simpler but make more assumptions.

This conversation is going in circles and there is too much certainty in your comments for my taste. Our lack of shared experience is preventing us from communicating productively, and in particular, it's pretty hard for me to grok everything you're saying. I personally don't have any experience with C#'s pipeline concept, so I can't really keep up. I suspect the reverse is true as well.

Usually the thing that helps move discussions like this forward is code, but building a prototype of your ideas in Rust is probably a lot of work. So I don't know how to continue. Sorry.

9 Likes

With composed readers. ByteReader (lowest-level provided by std) -> BufferedReader (provided by std) -> FloatReader (provided by a crate/make it). Same as anywhere else.

My proposal is a lower "denominator" than the current Read trait. In fact, now that I think about it, I was wrong in my earlier statement that it can't be retrofitted into Rust because it's inherently incompatible with what's already there.

The exact opposite is true: It is a strict superset of std::io::Read, allowing it to implement the Read trait for the special case of u8. Meanwhile, the Read trait cannot implement the more elegant zero-copy trait, because:

  • It cannot read without consuming bytes.
  • It cannot read non-copy types even if generalised to a template trait with a default u8 parameter.
  • It breaks the performance contract of zero copy.

Lets call my proposal Read2:

trait Read2  {
    type Data; //  = u8; // with associated type defaults.
    type Error; // = (); // with associated type defaults.

    /// Returns at least 'items', which can be 0 for best-effort.
    fn peek(&mut self, items: usize ) -> Result<&[Self::Data],Self::Error>;

    /// Can consume any number of items, acting much like `skip()`.
    fn consume(&mut self, items: usize ) -> Result<(), Self::Error>;
}

// Ta-da: backwards-compatibility!
impl std::io::Read for Read2<Data=u8,Error=std::io::Error> {
    fn read(&mut self, buf: &mut [u8]) -> Result<usize, std::io::Error> {
        let read_items : usize;
        // Even with NLL this is required. Ugh!
        {
            let temp = self.peek( 0 )?;
            read_items  = temp.len();
            // THIS is the unavoidable copy inherent in all implementors
            // of std::io::Read.
            buf[..temp.len()].copy_from_slice( temp );
        }
        self.consume(read_items )?;
        Ok( read_items  )
    }

    fn read_exact(&mut self, buf: &mut [u8]) -> Result<(), std::io::Error> {
        // Directly calling buf.len() twice makes the borrow checker cry,
        // a temp copy of the len is required. Once again... Ugh!
        let request_items: usize = buf.len();
        buf.copy_from_slice( self.peek( request_items )? );
        self.consume( request_items )?;
        Ok( () )
    }
}

Now sit down for a second and picture how elegant it would be to implement memory mapped files using Read2 compared to Read. For example, in ripgrep, @BurntSushi had to write two complete implementations of the "searcher" struct, because std::io::Read would have been inefficient when using mmap:

The elegance of the System.IO.Pipeline model of "you get a reference to a buffer with at least 'x' items to peek" instead of "fill a buffer and now it's your problem" would mean that it's likely that the entirety of search_buffer.rs file from ripgrep could be deleted (another 400 lines alongside the BOM/UCS16 code which could be simplified using the peek API model). On top of that, I suspect that this chunk of rather complex code would also massively simplify, because you no longer have to worry about rolling over partially consumed buffers yourself:

I'm more impressed now at @BurntSushi's work, but I shouldn't be. He's reinventing wheels and necessarily duplicating code that ought to be reusing the same abstract trait for all implementations...

Please convince yourself that an implementation of Read2 for std::fs::File cannot have fewer copies than what the Read trait does. The type signatures alone already tell me that, and indeed, they directly imply that any implementation of Read2 for std::fs::File that uses standard read calls must necessarily maintain an internal buffer. This is what std::io::BufReader does for any implementation of Read, but Read does not require the use of an internal buffer and is thus more flexible.

You fundamentally misunderstand the purpose of the buffer rolling. It has to do with line oriented searching, context handling and limiting the use of heap memory. In ripgrep's implementation, there are exactly as many copies as would be done if it used your Read2 implementation. Moreover, ripgrep's implementation permits the amortization of allocation, which is critical, and it's not obvious to me how that would be done with your Read2 trait.

More generally, your Read2 trait assumes the use of an internal buffer. ripgrep's searching requires not only the use of an internal buffer, but one that can be extended dynamically based on the size of the largest line. It's not obvious to me that an implementation of Read2 would support such a use case.

Finally, that there are two implementations of ripgrep's search is a failing of mine, not of the Read trait. The implementations have been unified in my dev branch as part of factoring more of ripgrep's internals out into libraries, and I didn't need Read2 to do it. Moreover, in my dev branch, the library supports an important new feature: the ability to limit or control the amount of heap allocation being done. If I used Read2, then I don't see how that could be implemented using the interface you've provided. It can be done with the Read trait however because the Read trait makes far fewer assumptions than Read2.

This is once again false. I explained why above. This is why I suggested that we stopped communicating, because it isn't productive. The regex engine itself requires UTF-8, so regardless of what an implementation of Read2 yields---whether its u16 or otherwise---some explicit transcoding step is required.

6 Likes

Note that you borrow self mutably and then (presumably) return slice to a buffer inside self. This will not work nicely, as you'll have to drop &[Self::Data] before calling peek again, otherwise borrow checker will rightfully yell at you. Borrow regions could probably help here, but your proposal has another problem. How do you think it will work with e.g. buffered file IO? Also it will not work if underlying buffer is not owned without GAT.

No one forbids you from prototyping such zero-copy Read trait in a separate crate. Unfortunately you will not be able to write impl<T: ZeroCopyRead> Read for T { .. } (I really hope we will be able to do it in future), but it's not a huge problem. If your design will be good enough and will find enought traction, then the next step could be an RFC with proposal of this trait addition to std. But as you can see you'll probably encounter a lot of problems and blockers. I agree with @BurntSushi that it's not productive to discuss such hand-wavy proposals, with arguably weak motivation.

2 Likes

This seems like a higher level API than Read. Non destructive peeking means the source has a buffer (either naturally or manufactured), whereas Read is just a stream. If you want to add a buffer on top and allow peeking, you can do that yourself (or use BufReader or BufRead trait in the API requirements). BufRead has a fill_buf/consume duo that can be used to do buffering and peeking, and then advancement.

I’m not really seeing an issue with Read being the lowest level API. It allows others to build on top. Is the concern then that Rust stdlib doesn’t have a Pipeline-like API out of the box?

5 Likes

Slight tangent but the sdl2 case linked in that discussion is very compelling. The issue of abstraction via methods vs running into borrowck issues comes up on this forum often as well. I think it’s easier to work around it in private APIs or code but it makes it harder to design APIs that you can’t change willy-nilly after release. This added design axis is definitely a challenge IMO.

IIUC the main issue with Read trait highlighted by @peter_bertok is that it's not suited for zero-copy processing, e.g. when you use memory mapped files or have all data in memory already. It's indeed an issue, e.g. in rosbag crate I had to write (almost) zero-copy parser myself (though something like nom could've been probably usefull), but I think that solution should be prototyped outside of the std (and maybe even stay outside of it) and that it's currently blocked on implementation of the mentioned proposals.

1 Like

Incidentally, the last time this came up on internals, we failed to come up with any use cases where a "partial borrows" feature in the core language would've actually helped. Specifically, it seemed like all the use cases could be worked around with some refactoring, and any core language change that did not require such refactoring would amount to turning a borrow check error into a silent API compatibility hazard, which seemed like a net loss.

So if anybody does know of a compelling use case where that argument doesn't hold, please necro that thread!

4 Likes

I've read through this thread [edit: most of it; see below] and the blogpost about the Pipelines interface, and I'm not quite sure I understand your current position on the stream-vs-pipes issue, @peter_bertok.

In light of this addendum:

  • Do you still think the Pipeline implementation in C# is an example of what you'd prefer to see in Rust?
  • Since Pipelines are implemented in C# with byte-streams, do you still think the Pipe concept is incompatible with a byte-stream API in principle?
    • You mentioned downthread that you were "wrong...[that Read2 is] inherently incompatible with what's already there," but you went on to say that you would expect Read to be implemented using Read2 (and gave the code for this implementation), not the other way around.
    • Since the C# version of your preferred API is implemented using bytestreams, but you think that's the wrong approach, can you provide an example of an IO pipeline implementation that doesn't rely on bytestreams, or flesh out your ideas for how such a thing could exist? It doesn't seem to me that you've fully addressed the point upthread (unfortunately I can't find the quote) that everything in memory really is byte-based at the lowest level, nor @BurntSushi's point that any implementation of the Pipeline API will necessarily require a large internal buffer that the client can't control.

In short, do you still think that Rust has taken a "wrong turn" that will have permanent negative repercussions that can't be solved without a major breaking change (comparable to e.g. the change in byte/string handling in Python 3), or are you just concerned that the standard library is too "thin" compared to more "batteries-included" languages?

[Edit: this is much the same as @vitalyd's question a few posts up, which was written after I began composing this post.]

This is a fairly minor point, but I'm a bit confused by your "placement" of various languages.

  • It seems like you're saying that Rust "skipped over" the first four issues, but hasn't adequately addressed the fifth. So would it be more accurate to say that Rust is "in between" issues 4 and 5, rather than "on" step 4?
  • I'm familiar with C++14 but don't know as much about C++17. What (if anything) is C++ doing to address this issue? (I agree that in the general case, C++ has a major "legacy garbage" problem, but I'm unaware of anything like the Pipeline API in the C++17 standard library.)
1 Like

If you have all data in memory and want zero copy, isn’t that what a &[u8] is for? It sounds to me like Pipelines (I didn’t read that much about it so I might be wrong) manages internal buffers for you, does whatever I/O to fill them, and then exposes slices into it while allowing you to indicate that you’re done with a slice. It sounds very much like ringbuffers used in networking. I agree it’s useful but it’s fundamentally a different API and usecase from Read, or so it seems to me.

4 Likes