Rust beginner notes & questions

std::io::Read doesn’t forces utf8. In fact, it does not imply any encoding - It’s just a stream of bytes. It can be a utf8 encoded text file from local disk, euc-kr encoded html from gunzip stream, or even a jpg encoded picture of kitten from the internet.

Read is used for low level abstraction in io context. It only cares bytes, because everything in memory are bytes! Arbitrary typed generic iterator, which should be std::iter::Iterator, can be constructed on top of it.

I think what makes you feeling such is std::io::BufRead::read_line(), which assumes that input stream is utf8-encoded. This is just a simple shortcut for common case, as most streams we handle line-by-line are utf8 encoded. But if it’s not your case, you can always bypass such highlevel api and handle bytes directly.

2 Likes

https://doc.rust-lang.org/std/io/trait.Read.html#method.read_to_string is what doesn’t really belong there, but I suspect was added as a convenience. An implementation that doesn’t have UTF8 strings internally can return an error for that method, but that method ought to not be there in the first place.

1 Like

I honestly wish I could do exactly that, but I don’t use Rust enough to really contribute meaningfully. I’ve dabbled with it just long enough to determine that it won’t help me in any future projects.

Right now, for the kind of work I’m doing, the runtime overhead of C# for me is relatively unimportant compared to its productivity, which is the best of any language I’ve personally used. My next step up would be switching to F# on dotnet core, as that would both boost my productivity and performance significantly. The extra 20%-50% runtime performance from Rust just isn’t worth it compared to the drop in productivity.

For example, Rust Windows interop is… not pretty right now. There just isn’t the same kind of pre-packaged, ready-to-use wrappers around the Win32 APIs that C# has. Does it have the ability to call COM yet? DCOM+? Can you create a socket server with Active Directory Kerberos authentication? Can I validate a certificate against the machine trust store? Last time I checked, there were blocking issues for most of my use-cases, and to be honest I gave up after getting bogged down in all the niggling little issues related to UCS-16 string handling.

At the end of the day, 90% of desktops are still Windows, and well over 50% of all enterprise servers run it too. Rust is very Linux/POSIX centric. All the performance or safety in the world doesn’t help if I can’t get off the ground and make productive progress on a useful project…

2 Likes

Regarding Read and UCS-16: you always can write an extension trait which will implement convinience UCS-16 methods while using raw bytes IO under the hood. Should UCS-16 methods or methods which will accept different encodings be in the std? Personally I don’t think so, but it’s a good idea for a crate. (maybe it already exists?)

Have you never written a parser?

What happens when you read a stream of bytes that’s actually UTF-16 encoded?

You get a stream of 16-bit codepoints. Not bytes.

Then if you wish to parse this further with a lexer, you’ll get a stream of tokens, typically 32-bit integers. Not bytes.

Not everything is byte, that’s why we have strongly typed languages.

Not everything that streams large chunks of contiguous data around is a POSIX file handle and returns 32-bit integer I/O error codes.

In my mind, the ideal trait inheritance hierarchy ought to look something like the following:

// A stream is just a "fat" iterator.
pub trait Read : Iterator {
    type Error=();

    // Shamelessly copying the C# Pipeline concept here
    fn read( &mut self, required_items: usize = 0 ) -> Result<&[Self::Item],Self::Error>;
    
    // Ditto.
    fn consume( &mut self, items_used: usize );
    
    // A stream *really is* an Iterator, allowing fn next() to have a default impl in terms of stream functions!
    // Now if "impl trait" was used in Iterator's fns, Read could *specialise* things like fn peekable() and the like
    // with versions optimised for streams...
    fn next(&mut self) -> Option<Self::Item> {
        if let Ok(b) = self.read( 1 ) {
            self.consume( 1 );
            return Some(b[0]);
        }
        else {
            return None;
        }
    }
}

pub trait AsyncRead : Read { 
    // ... Futures-based async versions of fn read() goes here ...
}

// Defaults to bytes, but doesn't force it!
pub trait IORead<Item=u8,Error=i32> : AsyncRead {

    fn close( &mut self );

    fn seek( &mut self, position: u64 );
    
    // ... other functions that are more specific to file descriptors / handles ...
}

Now imagine that you want to parse an XML file with an unknown encoding. Right now, this is… icky in most languages, because you have to read a chunk of the header, try various encodings to find the bit that says what encoding the file is in, then restart from the beginning using a wrapper that converts from bytes to characters. But you’ve already read a bunch of bytes, so now what? Not all streams are rewindable!

With something like the new C# Pipeline I/O API, the low-level parser would start off with a Read<Item=u8>, make the encoding decision, and then the high-level XML parser could use Read<Item=char>. The encoding switch at the beginning would be very neat because you just don’t call consume(); This would work fine even on forward-only streams such as a socket returning compressed data.

Similarly, if the String type was instead a trait that &[char] mostly implemented, zero-copy parsers would be fairly straightforward with this overall approach…

Behind the scenes, advanced implementations could keep pools of buffers and use scatter/gather I/O for crazy performance. The developer wouldn’t even have to know…

This is what the new C# I/O API is trying to do, but it’s not using the power of template programming to the same level that Rust could. Compare the C# Iterator<T> interface to the Rust Iterator trait. It’s night & day!

3 Likes

In tokio land, you’d implement this with a Decoder layered on top of a raw byte stream (at the lowest level, this is always the type of the stream). The decoder would turn the bytes into whatever higher level type you want, and consumers would work off streams that are decoded underneath. This is all type-safe and uses generics extensively, so gets the optimization/codegen benefits of that. You can then also take a decoded stream/sink and split it into a read and write halves, if you want to operate over the duplex separately. Perhaps you can look at tokio/futures and see if you like it better.

2 Likes

How do you plan represent UTF-8 in such approach?

Your posts seems to be written under assumption of fixed-sized encodings. I understand that you come from the Windows world, but Rust has made a consious desicion to use UTF-8 as the main string encoding and supporting all other kinds of endoings in the std will just lead to bloat. And if I understood your proposal it will result in needless compexity in a lot of the code.

Do we need Windows-oriented ecosystem of crates? Yes, of course. Rust provide excellent tools for developing them. But I don’t think that it’s reasonable to expect introduction of drastic changes to Rust core which will make Windows developers a bit happier, but will create a ton problems for others.

2 Likes

The problem of Iterator-only approach is, it doesn’t scale well to low-level. Rust is a system programming language. Common scenario of such io is memcpy incommimg bytes from os-managed buffer to my own, and parse that byte array to produce meaningful types. How can we model this operation with Iterator? Copy memory byte-by-byte is slower than memcpy over 10 times. Expose slice of internal buffer has lifetime issue as this buffer should be reused. Vec implies heap allocation for every read(), which cannot be acceptable.

ps. Rust’s char type is 4 byte integer, to represent full range of unicode scalar values

ps2. I did have implemented a parser a while ago. Try check here :smiley: https://github.com/HyeonuPark/Nal

2 Likes

ripgrep supports searching either UTF-8 or UTF-16 seamlessly, via BOM sniffing. My Windows users appreciate this. The search implementation itself only cares about getting something that implements io::Read. UTF-16 handling works by implementing a shim for io::Read that transcodes UTF-16 to UTF-8. I did this in about less than a day’s worth of a work and it was well worth it.

19 Likes

Ironically, I had the opposite experience when I did something similar a couple of years ago. What I looked up was the implementation for Box, expecting to see something quite simple and similar to C++ auto_ptr or unique_ptr. Perhaps if I looked again now that I have more experience, I would feel differently but at the time I felt that a significant amount of magic was being performed for what in C++ was quite straightforward.

1 Like

I suspect much of that magic is linked to support for Box<Trait>.

1 Like

Small off-topic: why it was decided to join Box<SizedType>, Box<[T]> and Box<Trait> under the single Box roof instead of using three separate types for each use-case, e.g. something like Dyn<Trait>for trait objects?

This sounds like the core of your concerns, and to be honest, I think it is definitely an area Rust could do better at. Maybe next year we need to have a Windows domain working group to put some serious effort behind it. I think the challenge is that it seems like Rust / open source developers gravitate more towards Mac / Linux than your average desktop user so the numbers are skewed.

Who knows, maybe one day Microsoft themselves might help out.

2 Likes

First of all, I have to say that ripgrep is impressive work!! I’ve used it just recently because it smokes everything else if you need to trawl through gigabytes of data for a keyword.

The whole argument I’ve been trying to clumsily make above is that your hard work for things like BOM detection and encoding switching should have been built-in to Rust and not a part of the ripgrep codebase. At the end of the day, what you’ve written is a single-purpose tool, but large chunks of its codebase looks general-purpose to me. That is the “code smell” that I’m concerned about. It indicates to me that the Rust library has too many gaps, and people have to reinvent wheels all over the place. Incompatible wheels.

If anything, your effort confirms my argument. E.g.:

If Read was a trait with a type parameter, this would not be an issue, because you could only ever read a whole number of u16 UCS codepoints out of something like Read<u16>!

You had to write about 300 lines of fairly complex code which I don’t believe is zero-copy. It looks like it’s making 2-3 copies when processing UCS-16, and probably at least 1 or 2 even with UTF-8 but I’m not sure. The Read trait that is inherently copy-based, so I don’t think there’s any way to avoid at least 1 copy.

I my imagination, an ideal API should support the most complex, worst-case scenario with the best possible performance. If it can do that, then everything simpler should just “fall in place” and developers like you would not have to reinvent wheels such as BOM detection and encoding switching.

As a worst-case example, imagine that someone wants to decode something hideous, such as:

  • An XML stream that may be in a variety of encodings. The standards-compliant way of doing this can involve reading dozens of bytes into the stream: https://www.w3.org/TR/xml/#sec-guessing-no-ext-info
  • The source is a forward-only stream (e.g.: an encrypted or compressed).
  • The source is being fed in by a user-mode network library, such as from a high-performance RDMA network driver (common with Infiniband or 40 Gbps Ethernet). To enable zero-copy, you can’t provide a buffer during the read() call. Instead, a large pool of buffers must be registered for use by the network stack up-front and then consumed by your code and returned to the pool.
  • The XML contains huge chunks of Base64 encoded binary blobs that are potentially too big to fit into memory. You’d have to stream these out into a destination stream during decoding.
  • The rest of the XML contains millions of small strings (element names) and integer values (element contents) that you do not want to heap allocate during decoding. It’s sufficient to simply compare the names against constant str values and decode the integers directly to i32 values. (e.g.: if xml.node_name == "foo" { ... } ).
  • You want to do all of this without reinventing the wheel at every step. E.g.: the base64 decoding for XML ought to be the same as base64 decoding used everywhere else.

The new C# Pipelines API is targetted at this kind of scenario. I looked at tokio as @vitalyd suggested, but it’s still doing permanently limiting things, such as advancing the stream on read_buf() and assuming that the underlying streams are made up of bytes. Interestingly, they’ve gone half-way with the BufMut trait, but that’s still very byte-centric and will likely not work well with things like text streams.

So for example, imaging you’re flying along, decoding the base64 data in nice 1MB buffer chunks or whatever and you discover that 732KB into the buffer you’ve just been given is the end of the binary data. The remaining 292KB is XML. Now what? Stuff the unconsumed data back into the previous stream level?

This is why the C# Pipelines API doesn’t consume buffers automatically, because then the base64 decoder can simply mark 732KB as consumed, mark itself as finished, and then the outer XML decoder can continue with the remaining 292KB. This is both smoother for the developer, and faster at runtime. You’ve already had to muck about with (thankfully small) buffers in ripgrep to do BOM detection. This can get much worse in more complex scenarios. Think 5-7 layers of decoder nesting, not just 1-2.

These tiny API design decisions can have huge ramifications down the track. Hence my disappointment with things like Read::read_to_string(). It shows that very minor short-term convenience won out over design that can last into the future.

Before people chime in and complain that I’m just inventing unrealistic scenarios, imagine trying to extend ripgrep to support searching through text in OOXML documents such as Word DOCX or Excel XLSX documents. These are potentially very large (>1GB), compressed via Zip, and can be encoded with either UTF-8 or UTF-16. Internally, the XML files can be split into “parts”, which are like Zip files split into multiple archives. A compliant decoder has to be able to: append streams, decode forward-only, do XML encoding detection, and stitch together XML text fragments into a single “character stream” to do matching on.

Now imagine writing a high-performance “virtual appliance” that does regular-expression based “data loss prevention” scanning of documents passing through it at 40 Gbps. In principle, this is not all that different to the ripgrep use-case, and the code ought to look similar.

1 Like

It really doesn’t. The transcoding is itself handled by a separate crate, and the shim itself isn’t specific to ripgrep and could be lifted into a separate crate. Any enterprising individual could accomplish that. ripgrep used to be much more monolithic, and I’ve been steadily moving pieces out into separate crates. The UTF-16 shim is one such candidate for moving into a separate crate, but nobody has put in the work to do it.

That’s false. UTF-16 is a variable width encoding (not all Unicode codepoints are representable via a single UTF-16 code unit), and I still need to transcode it to UTF-8 in order to search it. The regex engine could natively support UTF-16, but that has nothing to do with the definition of the Read trait and is a huge complication for very little gain. It’s much simpler to just transcode.

Which, again, could be shared with some effort. This is the premise of the Rust ecosystem: a small std library with a very low barrier to using crates in the ecosystem.

No. The shim is doing buffered reading. Specifically, if the shim is wrapped around a fs::File, then:

  1. UTF-16 encoded bytes are copied to an internal buffer directly from a read syscall (kernel to user).
  2. Transcoding is performed from the bytes in the internal buffer to the caller’s buffer directly.

A perusal of the code makes it look like an additional copy is happening, but in practice, this copy is just rolling a small number of bytes from the end of the buffer to the beginning of the buffer that either couldn’t fit in the caller’s buffer or represent an incomplete UTF-16 sequence.

No. The Read trait is just an OS independent interface that loosely describes how to read data. For example, when reading from a File, the buffer provided to the read method is going to be written to directly by the OS. That’s as little possible copying as you can do. To do better, you need to go into kernel land or use memory maps.

You’re conflating concepts here. The additional copying is only necessary because I’m doing transcoding and because I wanted buffered reading. The extra copy from the transcoding could be avoided if the regex engine supported searching UTF-16 encoded bytes directly, but it doesn’t. And again, this has nothing at all to do with the Read trait and everything to do with implementation details of how the regex engine was built.

(The extra copy here is also a red herring. The transcoding itself is the bottleneck.)

But ripgrep already does this, because Read implementations are composable:

$ cat sherlock
For the Doctor Watsons of this world, as opposed to the Sherlock
Holmeses, success in the province of detective work must always
be, to a very large extent, the result of luck. Sherlock Holmes
can extract a clew from a wisp of straw or a flake of cigar ash;
but Doctor Watson has to have it taken out for him and dusted,
and exhibited clearly, with a label attached.
$ iconv -f UTF-8 -t UTF-16 sherlock > sherlock-utf16

$ rg Watson sherlock
1:For the Doctor Watsons of this world, as opposed to the Sherlock
5:but Doctor Watson has to have it taken out for him and dusted,

$ rg Watson sherlock-utf16
1:For the Doctor Watsons of this world, as opposed to the Sherlock
5:but Doctor Watson has to have it taken out for him and dusted,

$ gzip sherlock-utf16
$ rg -z Watson sherlock-utf16.gz
1:For the Doctor Watsons of this world, as opposed to the Sherlock
5:but Doctor Watson has to have it taken out for him and dusted,

How do you think this works? There’s a shim for doing gzip decompression, just like for UTF-16 transcoding. These shims don’t know about each other but compose perfectly fine. This is the first time I’ve even bothered to try searching gzip compressed UTF-16, and it “just worked.”

Yes, ripgrep contains these shims, but that’s just because nobody has productionized them. This doesn’t mean Rust’s standard library has to do it, or even that the Read trait needs to change for this to happen. Somebody just needs to put in the work, and that’s true regardless of whether it lives in std or in a crate.

I don’t see any reason why the presence of read_to_string prevents the use cases you’re talking about.

There are certainly a lot of moving parts here, but I don’t see any reason why the Rust ecosystem isn’t well suited to solve a problem like this. The interesting bits are building compliant decoders and supporting routines that can search character streams (which in the general case is always going to be slow). The Read trait isn’t going to prevent you from doing that.

13 Likes

How would you implement this for a Read over f32? What if you’re trying to treat incoming measurement data as a stream of numbers, e.g.: for DSP-style programming? You can always use unimplemented!() or panic!(), but that’s really icky because then libraries all over the place will have to include code that can crash the process.

Or, you could always reinvent the general concept of streams for you special case.

Either way, eww…

There are certainly a lot of moving parts here…

There doesn’t have to be!

Your 300 lines of BOM-peeker code never needed to exist in the first place. If Read was designed more like System.IO.Pipelines and separated the “get a buffer” and “consume input items” concepts then the BOM detection code would be hilariously trivial.

It would look vaguely like the following:

fn bom_detection<R:Read2> ( source: R ) -> impl Read2 {
    // The reader provides the buffer, and we specify the *minimum* required elements (bytes).
    if let Ok(buf) = source.read( 3 ) {
        match buf {
            // variable number of bytes can be consumed!
            [0xEF,0xBB,0xBF] => { source.consume(3); source },
            [0xFE, 0xFF,_]   => { source.consume(2); UCS16BigEndianConverter::new( source ); }
            [0xFF, 0xFE,_]   => { source.consume(2); UCS16LittleEndianConverter::new( source ); }
            _ => source // pass-through works trivially because we're not forced to consume any of the bytes!
        }
    }
}

Note the similarity with PEG-based parsers such as the nom crate, which use a similar “peek-and-consume-if-matched” pattern.

Similarly, zero-copy I/O doesn’t have to be complicated, but the only way to do that is for read() to provide the buffer to the consumer instead of the consumer passing in a buffer to be filled. These are fundamentally opposite concepts, and the latter can never support the zero-copy scenario in the general-case.

Your ripgrep utility gets to cheat with the special case of memory-mapping files, but this just doesn’t work for network streams.

Think about it: where is the network stack going to put the data in between read() calls if read() is providing it the buffers… one at a time? The only way to do this is to give the network stack a bunch of buffers that it can fill itself. The user-mode code can then consume some of the buffers, process them, and return them to the pool while the rest of the buffers are being filled behind the scenes by RDMA.

It’s a push-vs-pull API difference that can never be reconciled. You have to do one or the other. This should have been foreseen, but wasn’t. The Windows Network Direct API has been around since 2010, and IIRC Linux actually beat them to it by several years because of the pervasive use of this type of programming in HPC clusters. Both revolve around providing a pool of buffers up front.

I’m not saying that the Rust team needed to implement HPC RDMA I/O from day #1, but a tiny bit of foresight is all it takes. The difference is literally 1 vs 2 fn-s in the Read trait.

So all I’m saying is that the design of “all streams can be thought of as copying into a client-provided byte buffer, which is basically UTF8 enough of the time for this convenience method to be present” is just wrong. No amount of wishful thinking will ever make this the general case. Meanwhile, the general case supports that scenario smoothly and integer token streams being produced by lexers, and f32 streams being passed to DSPs, and RDMA at 100Gbps, and so on, and so forth…

I think one key point of the discussion here is that Rust provides a thin runtime which is close to the lowest-common-denominator OS API (in this case POSIX’s byte-oriented IO), whereas C# provides a thick runtime which is close to programmer use cases (in this case structured IO). As much as I dislike this dichotomy, this is the textbook difference between a low-level programming language (annoying but predictable) and a high-level programming language (comfy but uncontrollable).

Now, of course, we could dream of an ideal world in which our programmer comfort would not be disturbed by crufty API design from the 70s. But if we cannot get that, the next best choice is to cater to both application devs and system devs using different tools. I would say that this is why having both Rust and C# is a good thing.

10 Likes

None of my examples with ripgrep used memory maps, so I don’t know why you’re bringing that up. My shim for transcoding doesn’t assume the presence of a caller provided buffer, but it could and the code would be simpler but make more assumptions.

This conversation is going in circles and there is too much certainty in your comments for my taste. Our lack of shared experience is preventing us from communicating productively, and in particular, it’s pretty hard for me to grok everything you’re saying. I personally don’t have any experience with C#'s pipeline concept, so I can’t really keep up. I suspect the reverse is true as well.

Usually the thing that helps move discussions like this forward is code, but building a prototype of your ideas in Rust is probably a lot of work. So I don’t know how to continue. Sorry.

9 Likes

With composed readers. ByteReader (lowest-level provided by std) -> BufferedReader (provided by std) -> FloatReader (provided by a crate/make it). Same as anywhere else.

My proposal is a lower “denominator” than the current Read trait. In fact, now that I think about it, I was wrong in my earlier statement that it can’t be retrofitted into Rust because it’s inherently incompatible with what’s already there.

The exact opposite is true: It is a strict superset of std::io::Read, allowing it to implement the Read trait for the special case of u8. Meanwhile, the Read trait cannot implement the more elegant zero-copy trait, because:

  • It cannot read without consuming bytes.
  • It cannot read non-copy types even if generalised to a template trait with a default u8 parameter.
  • It breaks the performance contract of zero copy.

Lets call my proposal Read2:

trait Read2  {
    type Data; //  = u8; // with associated type defaults.
    type Error; // = (); // with associated type defaults.

    /// Returns at least 'items', which can be 0 for best-effort.
    fn peek(&mut self, items: usize ) -> Result<&[Self::Data],Self::Error>;

    /// Can consume any number of items, acting much like `skip()`.
    fn consume(&mut self, items: usize ) -> Result<(), Self::Error>;
}

// Ta-da: backwards-compatibility!
impl std::io::Read for Read2<Data=u8,Error=std::io::Error> {
    fn read(&mut self, buf: &mut [u8]) -> Result<usize, std::io::Error> {
        let read_items : usize;
        // Even with NLL this is required. Ugh!
        {
            let temp = self.peek( 0 )?;
            read_items  = temp.len();
            // THIS is the unavoidable copy inherent in all implementors
            // of std::io::Read.
            buf[..temp.len()].copy_from_slice( temp );
        }
        self.consume(read_items )?;
        Ok( read_items  )
    }

    fn read_exact(&mut self, buf: &mut [u8]) -> Result<(), std::io::Error> {
        // Directly calling buf.len() twice makes the borrow checker cry,
        // a temp copy of the len is required. Once again... Ugh!
        let request_items: usize = buf.len();
        buf.copy_from_slice( self.peek( request_items )? );
        self.consume( request_items )?;
        Ok( () )
    }
}

Now sit down for a second and picture how elegant it would be to implement memory mapped files using Read2 compared to Read. For example, in ripgrep, @BurntSushi had to write two complete implementations of the “searcher” struct, because std::io::Read would have been inefficient when using mmap:

The elegance of the System.IO.Pipeline model of “you get a reference to a buffer with at least ‘x’ items to peek” instead of “fill a buffer and now it’s your problem” would mean that it’s likely that the entirety of search_buffer.rs file from ripgrep could be deleted (another 400 lines alongside the BOM/UCS16 code which could be simplified using the peek API model). On top of that, I suspect that this chunk of rather complex code would also massively simplify, because you no longer have to worry about rolling over partially consumed buffers yourself:

I’m more impressed now at @BurntSushi’s work, but I shouldn’t be. He’s reinventing wheels and necessarily duplicating code that ought to be reusing the same abstract trait for all implementations…