Rust beginner notes & questions

It's definitely not all, but it's very close to it I think. Particularly if you generalize "memory management" to soundness. A lot of the difficulty is being a low level language trying to marry high level features while being sound at compile time. That's a very hefty (and praiseworthy!) goal.

4 Likes

I think to call C# a really well-designed language is a bit of an overstatement. As far as I'm concerned (and I like both Java and C# for what they are) it is just Java with a little better support for value types. They frankly got it wrong with exceptions (as you mentioned). They got it wrong in how they handle "null" (as you mentioned). They got it wrong wrt to volatile (as you mentioned). It really is only a marginal, at best, improvement (and that is even debatable) over Java. I really can't see much real advantage of C# over Java for most cases. Java tends to push as much as possible to libraries/JDK whereas C# tends to incorporate new language features more often, but, I really don't see that one is necessarily that much better or worse than the other. I prefer Java exception handling over C# exception handling, but, I like the Rust way of error/alternative handling even better. I think Rust is getting A LOT Of things right, but, there is definitely room for improvement and the comments by the OP can help inform the discussion (even if they do, at first reading, come off a little snide or combative).

5 Likes

IMO, C# is a well-designed language. Is it perfect? No, as mentioned. But I don't know any perfect language. I've not followed it too closely in the last few years, but I recall in the beginning there was nice consistency and "flow" to features added in version N and how they enabled something else in version N+1. There's a lot right about C# if you don't mind a GC/JIT/managed runtime.

To call C# "just Java with a little better support for value types" is ... disingenuous at best :slight_smile:. I really don't want to sidetrack this thread into Java vs C# (or Rust vs C#, for that matter), so I'll stop here. But I've used C# and Java extensively, and their comparison ends right on the surface for me.

5 Likes

std::io::Read doesn't forces utf8. In fact, it does not imply any encoding - It's just a stream of bytes. It can be a utf8 encoded text file from local disk, euc-kr encoded html from gunzip stream, or even a jpg encoded picture of kitten from the internet.

Read is used for low level abstraction in io context. It only cares bytes, because everything in memory are bytes! Arbitrary typed generic iterator, which should be std::iter::Iterator, can be constructed on top of it.

I think what makes you feeling such is std::io::BufRead::read_line(), which assumes that input stream is utf8-encoded. This is just a simple shortcut for common case, as most streams we handle line-by-line are utf8 encoded. But if it's not your case, you can always bypass such highlevel api and handle bytes directly.

2 Likes

Read in std::io - Rust is what doesn't really belong there, but I suspect was added as a convenience. An implementation that doesn't have UTF8 strings internally can return an error for that method, but that method ought to not be there in the first place.

1 Like

I honestly wish I could do exactly that, but I don't use Rust enough to really contribute meaningfully. I've dabbled with it just long enough to determine that it won't help me in any future projects.

Right now, for the kind of work I'm doing, the runtime overhead of C# for me is relatively unimportant compared to its productivity, which is the best of any language I've personally used. My next step up would be switching to F# on dotnet core, as that would both boost my productivity and performance significantly. The extra 20%-50% runtime performance from Rust just isn't worth it compared to the drop in productivity.

For example, Rust Windows interop is... not pretty right now. There just isn't the same kind of pre-packaged, ready-to-use wrappers around the Win32 APIs that C# has. Does it have the ability to call COM yet? DCOM+? Can you create a socket server with Active Directory Kerberos authentication? Can I validate a certificate against the machine trust store? Last time I checked, there were blocking issues for most of my use-cases, and to be honest I gave up after getting bogged down in all the niggling little issues related to UCS-16 string handling.

At the end of the day, 90% of desktops are still Windows, and well over 50% of all enterprise servers run it too. Rust is very Linux/POSIX centric. All the performance or safety in the world doesn't help if I can't get off the ground and make productive progress on a useful project...

2 Likes

Regarding Read and UCS-16: you always can write an extension trait which will implement convinience UCS-16 methods while using raw bytes IO under the hood. Should UCS-16 methods or methods which will accept different encodings be in the std? Personally I don't think so, but it's a good idea for a crate. (maybe it already exists?)

Have you never written a parser?

What happens when you read a stream of bytes that's actually UTF-16 encoded?

You get a stream of 16-bit codepoints. Not bytes.

Then if you wish to parse this further with a lexer, you'll get a stream of tokens, typically 32-bit integers. Not bytes.

Not everything is byte, that's why we have strongly typed languages.

Not everything that streams large chunks of contiguous data around is a POSIX file handle and returns 32-bit integer I/O error codes.

In my mind, the ideal trait inheritance hierarchy ought to look something like the following:

// A stream is just a "fat" iterator.
pub trait Read : Iterator {
    type Error=();

    // Shamelessly copying the C# Pipeline concept here
    fn read( &mut self, required_items: usize = 0 ) -> Result<&[Self::Item],Self::Error>;
    
    // Ditto.
    fn consume( &mut self, items_used: usize );
    
    // A stream *really is* an Iterator, allowing fn next() to have a default impl in terms of stream functions!
    // Now if "impl trait" was used in Iterator's fns, Read could *specialise* things like fn peekable() and the like
    // with versions optimised for streams...
    fn next(&mut self) -> Option<Self::Item> {
        if let Ok(b) = self.read( 1 ) {
            self.consume( 1 );
            return Some(b[0]);
        }
        else {
            return None;
        }
    }
}

pub trait AsyncRead : Read { 
    // ... Futures-based async versions of fn read() goes here ...
}

// Defaults to bytes, but doesn't force it!
pub trait IORead<Item=u8,Error=i32> : AsyncRead {

    fn close( &mut self );

    fn seek( &mut self, position: u64 );
    
    // ... other functions that are more specific to file descriptors / handles ...
}

Now imagine that you want to parse an XML file with an unknown encoding. Right now, this is... icky in most languages, because you have to read a chunk of the header, try various encodings to find the bit that says what encoding the file is in, then restart from the beginning using a wrapper that converts from bytes to characters. But you've already read a bunch of bytes, so now what? Not all streams are rewindable!

With something like the new C# Pipeline I/O API, the low-level parser would start off with a Read<Item=u8>, make the encoding decision, and then the high-level XML parser could use Read<Item=char>. The encoding switch at the beginning would be very neat because you just don't call consume(); This would work fine even on forward-only streams such as a socket returning compressed data.

Similarly, if the String type was instead a trait that &[char] mostly implemented, zero-copy parsers would be fairly straightforward with this overall approach...

Behind the scenes, advanced implementations could keep pools of buffers and use scatter/gather I/O for crazy performance. The developer wouldn't even have to know...

This is what the new C# I/O API is trying to do, but it's not using the power of template programming to the same level that Rust could. Compare the C# Iterator<T> interface to the Rust Iterator trait. It's night & day!

3 Likes

In tokio land, you'd implement this with a Decoder layered on top of a raw byte stream (at the lowest level, this is always the type of the stream). The decoder would turn the bytes into whatever higher level type you want, and consumers would work off streams that are decoded underneath. This is all type-safe and uses generics extensively, so gets the optimization/codegen benefits of that. You can then also take a decoded stream/sink and split it into a read and write halves, if you want to operate over the duplex separately. Perhaps you can look at tokio/futures and see if you like it better.

2 Likes

How do you plan represent UTF-8 in such approach?

Your posts seems to be written under assumption of fixed-sized encodings. I understand that you come from the Windows world, but Rust has made a consious desicion to use UTF-8 as the main string encoding and supporting all other kinds of endoings in the std will just lead to bloat. And if I understood your proposal it will result in needless compexity in a lot of the code.

Do we need Windows-oriented ecosystem of crates? Yes, of course. Rust provide excellent tools for developing them. But I don't think that it's reasonable to expect introduction of drastic changes to Rust core which will make Windows developers a bit happier, but will create a ton problems for others.

2 Likes

The problem of Iterator-only approach is, it doesn't scale well to low-level. Rust is a system programming language. Common scenario of such io is memcpy incommimg bytes from os-managed buffer to my own, and parse that byte array to produce meaningful types. How can we model this operation with Iterator? Copy memory byte-by-byte is slower than memcpy over 10 times. Expose slice of internal buffer has lifetime issue as this buffer should be reused. Vec implies heap allocation for every read(), which cannot be acceptable.

ps. Rust's char type is 4 byte integer, to represent full range of unicode scalar values

ps2. I did have implemented a parser a while ago. Try check here :smiley: https://github.com/HyeonuPark/Nal

2 Likes

ripgrep supports searching either UTF-8 or UTF-16 seamlessly, via BOM sniffing. My Windows users appreciate this. The search implementation itself only cares about getting something that implements io::Read. UTF-16 handling works by implementing a shim for io::Read that transcodes UTF-16 to UTF-8. I did this in about less than a day's worth of a work and it was well worth it.

19 Likes

Ironically, I had the opposite experience when I did something similar a couple of years ago. What I looked up was the implementation for Box, expecting to see something quite simple and similar to C++ auto_ptr or unique_ptr. Perhaps if I looked again now that I have more experience, I would feel differently but at the time I felt that a significant amount of magic was being performed for what in C++ was quite straightforward.

1 Like

I suspect much of that magic is linked to support for Box<Trait>.

1 Like

Small off-topic: why it was decided to join Box<SizedType>, Box<[T]> and Box<Trait> under the single Box roof instead of using three separate types for each use-case, e.g. something like Dyn<Trait>for trait objects?

This sounds like the core of your concerns, and to be honest, I think it is definitely an area Rust could do better at. Maybe next year we need to have a Windows domain working group to put some serious effort behind it. I think the challenge is that it seems like Rust / open source developers gravitate more towards Mac / Linux than your average desktop user so the numbers are skewed.

Who knows, maybe one day Microsoft themselves might help out.

2 Likes

First of all, I have to say that ripgrep is impressive work!! I've used it just recently because it smokes everything else if you need to trawl through gigabytes of data for a keyword.

The whole argument I've been trying to clumsily make above is that your hard work for things like BOM detection and encoding switching should have been built-in to Rust and not a part of the ripgrep codebase. At the end of the day, what you've written is a single-purpose tool, but large chunks of its codebase looks general-purpose to me. That is the "code smell" that I'm concerned about. It indicates to me that the Rust library has too many gaps, and people have to reinvent wheels all over the place. Incompatible wheels.

If anything, your effort confirms my argument. E.g.:

https://github.com/BurntSushi/ripgrep/blob/b38b101c77003fb94aaaa8084fcb93b6862586eb/src/decoder.rs#L122-L126

If Read was a trait with a type parameter, this would not be an issue, because you could only ever read a whole number of u16 UCS codepoints out of something like Read<u16>!

You had to write about 300 lines of fairly complex code which I don't believe is zero-copy. It looks like it's making 2-3 copies when processing UCS-16, and probably at least 1 or 2 even with UTF-8 but I'm not sure. The Read trait that is inherently copy-based, so I don't think there's any way to avoid at least 1 copy.

I my imagination, an ideal API should support the most complex, worst-case scenario with the best possible performance. If it can do that, then everything simpler should just "fall in place" and developers like you would not have to reinvent wheels such as BOM detection and encoding switching.

As a worst-case example, imagine that someone wants to decode something hideous, such as:

  • An XML stream that may be in a variety of encodings. The standards-compliant way of doing this can involve reading dozens of bytes into the stream: Extensible Markup Language (XML) 1.0 (Fifth Edition)
  • The source is a forward-only stream (e.g.: an encrypted or compressed).
  • The source is being fed in by a user-mode network library, such as from a high-performance RDMA network driver (common with Infiniband or 40 Gbps Ethernet). To enable zero-copy, you can't provide a buffer during the read() call. Instead, a large pool of buffers must be registered for use by the network stack up-front and then consumed by your code and returned to the pool.
  • The XML contains huge chunks of Base64 encoded binary blobs that are potentially too big to fit into memory. You'd have to stream these out into a destination stream during decoding.
  • The rest of the XML contains millions of small strings (element names) and integer values (element contents) that you do not want to heap allocate during decoding. It's sufficient to simply compare the names against constant str values and decode the integers directly to i32 values. (e.g.: if xml.node_name == "foo" { ... } ).
  • You want to do all of this without reinventing the wheel at every step. E.g.: the base64 decoding for XML ought to be the same as base64 decoding used everywhere else.

The new C# Pipelines API is targetted at this kind of scenario. I looked at tokio as @vitalyd suggested, but it's still doing permanently limiting things, such as advancing the stream on read_buf() and assuming that the underlying streams are made up of bytes. Interestingly, they've gone half-way with the BufMut trait, but that's still very byte-centric and will likely not work well with things like text streams.

So for example, imaging you're flying along, decoding the base64 data in nice 1MB buffer chunks or whatever and you discover that 732KB into the buffer you've just been given is the end of the binary data. The remaining 292KB is XML. Now what? Stuff the unconsumed data back into the previous stream level?

This is why the C# Pipelines API doesn't consume buffers automatically, because then the base64 decoder can simply mark 732KB as consumed, mark itself as finished, and then the outer XML decoder can continue with the remaining 292KB. This is both smoother for the developer, and faster at runtime. You've already had to muck about with (thankfully small) buffers in ripgrep to do BOM detection. This can get much worse in more complex scenarios. Think 5-7 layers of decoder nesting, not just 1-2.

These tiny API design decisions can have huge ramifications down the track. Hence my disappointment with things like Read::read_to_string(). It shows that very minor short-term convenience won out over design that can last into the future.

Before people chime in and complain that I'm just inventing unrealistic scenarios, imagine trying to extend ripgrep to support searching through text in OOXML documents such as Word DOCX or Excel XLSX documents. These are potentially very large (>1GB), compressed via Zip, and can be encoded with either UTF-8 or UTF-16. Internally, the XML files can be split into "parts", which are like Zip files split into multiple archives. A compliant decoder has to be able to: append streams, decode forward-only, do XML encoding detection, and stitch together XML text fragments into a single "character stream" to do matching on.

Now imagine writing a high-performance "virtual appliance" that does regular-expression based "data loss prevention" scanning of documents passing through it at 40 Gbps. In principle, this is not all that different to the ripgrep use-case, and the code ought to look similar.

1 Like

It really doesn't. The transcoding is itself handled by a separate crate, and the shim itself isn't specific to ripgrep and could be lifted into a separate crate. Any enterprising individual could accomplish that. ripgrep used to be much more monolithic, and I've been steadily moving pieces out into separate crates. The UTF-16 shim is one such candidate for moving into a separate crate, but nobody has put in the work to do it.

That's false. UTF-16 is a variable width encoding (not all Unicode codepoints are representable via a single UTF-16 code unit), and I still need to transcode it to UTF-8 in order to search it. The regex engine could natively support UTF-16, but that has nothing to do with the definition of the Read trait and is a huge complication for very little gain. It's much simpler to just transcode.

Which, again, could be shared with some effort. This is the premise of the Rust ecosystem: a small std library with a very low barrier to using crates in the ecosystem.

No. The shim is doing buffered reading. Specifically, if the shim is wrapped around a fs::File, then:

  1. UTF-16 encoded bytes are copied to an internal buffer directly from a read syscall (kernel to user).
  2. Transcoding is performed from the bytes in the internal buffer to the caller's buffer directly.

A perusal of the code makes it look like an additional copy is happening, but in practice, this copy is just rolling a small number of bytes from the end of the buffer to the beginning of the buffer that either couldn't fit in the caller's buffer or represent an incomplete UTF-16 sequence.

No. The Read trait is just an OS independent interface that loosely describes how to read data. For example, when reading from a File, the buffer provided to the read method is going to be written to directly by the OS. That's as little possible copying as you can do. To do better, you need to go into kernel land or use memory maps.

You're conflating concepts here. The additional copying is only necessary because I'm doing transcoding and because I wanted buffered reading. The extra copy from the transcoding could be avoided if the regex engine supported searching UTF-16 encoded bytes directly, but it doesn't. And again, this has nothing at all to do with the Read trait and everything to do with implementation details of how the regex engine was built.

(The extra copy here is also a red herring. The transcoding itself is the bottleneck.)

But ripgrep already does this, because Read implementations are composable:

$ cat sherlock
For the Doctor Watsons of this world, as opposed to the Sherlock
Holmeses, success in the province of detective work must always
be, to a very large extent, the result of luck. Sherlock Holmes
can extract a clew from a wisp of straw or a flake of cigar ash;
but Doctor Watson has to have it taken out for him and dusted,
and exhibited clearly, with a label attached.
$ iconv -f UTF-8 -t UTF-16 sherlock > sherlock-utf16

$ rg Watson sherlock
1:For the Doctor Watsons of this world, as opposed to the Sherlock
5:but Doctor Watson has to have it taken out for him and dusted,

$ rg Watson sherlock-utf16
1:For the Doctor Watsons of this world, as opposed to the Sherlock
5:but Doctor Watson has to have it taken out for him and dusted,

$ gzip sherlock-utf16
$ rg -z Watson sherlock-utf16.gz
1:For the Doctor Watsons of this world, as opposed to the Sherlock
5:but Doctor Watson has to have it taken out for him and dusted,

How do you think this works? There's a shim for doing gzip decompression, just like for UTF-16 transcoding. These shims don't know about each other but compose perfectly fine. This is the first time I've even bothered to try searching gzip compressed UTF-16, and it "just worked."

Yes, ripgrep contains these shims, but that's just because nobody has productionized them. This doesn't mean Rust's standard library has to do it, or even that the Read trait needs to change for this to happen. Somebody just needs to put in the work, and that's true regardless of whether it lives in std or in a crate.

I don't see any reason why the presence of read_to_string prevents the use cases you're talking about.

There are certainly a lot of moving parts here, but I don't see any reason why the Rust ecosystem isn't well suited to solve a problem like this. The interesting bits are building compliant decoders and supporting routines that can search character streams (which in the general case is always going to be slow). The Read trait isn't going to prevent you from doing that.

13 Likes

How would you implement this for a Read over f32? What if you're trying to treat incoming measurement data as a stream of numbers, e.g.: for DSP-style programming? You can always use unimplemented!() or panic!(), but that's really icky because then libraries all over the place will have to include code that can crash the process.

Or, you could always reinvent the general concept of streams for you special case.

Either way, eww...

There are certainly a lot of moving parts here...

There doesn't have to be!

Your 300 lines of BOM-peeker code never needed to exist in the first place. If Read was designed more like System.IO.Pipelines and separated the "get a buffer" and "consume input items" concepts then the BOM detection code would be hilariously trivial.

It would look vaguely like the following:

fn bom_detection<R:Read2> ( source: R ) -> impl Read2 {
    // The reader provides the buffer, and we specify the *minimum* required elements (bytes).
    if let Ok(buf) = source.read( 3 ) {
        match buf {
            // variable number of bytes can be consumed!
            [0xEF,0xBB,0xBF] => { source.consume(3); source },
            [0xFE, 0xFF,_]   => { source.consume(2); UCS16BigEndianConverter::new( source ); }
            [0xFF, 0xFE,_]   => { source.consume(2); UCS16LittleEndianConverter::new( source ); }
            _ => source // pass-through works trivially because we're not forced to consume any of the bytes!
        }
    }
}

Note the similarity with PEG-based parsers such as the nom crate, which use a similar "peek-and-consume-if-matched" pattern.

Similarly, zero-copy I/O doesn't have to be complicated, but the only way to do that is for read() to provide the buffer to the consumer instead of the consumer passing in a buffer to be filled. These are fundamentally opposite concepts, and the latter can never support the zero-copy scenario in the general-case.

Your ripgrep utility gets to cheat with the special case of memory-mapping files, but this just doesn't work for network streams.

Think about it: where is the network stack going to put the data in between read() calls if read() is providing it the buffers... one at a time? The only way to do this is to give the network stack a bunch of buffers that it can fill itself. The user-mode code can then consume some of the buffers, process them, and return them to the pool while the rest of the buffers are being filled behind the scenes by RDMA.

It's a push-vs-pull API difference that can never be reconciled. You have to do one or the other. This should have been foreseen, but wasn't. The Windows Network Direct API has been around since 2010, and IIRC Linux actually beat them to it by several years because of the pervasive use of this type of programming in HPC clusters. Both revolve around providing a pool of buffers up front.

I'm not saying that the Rust team needed to implement HPC RDMA I/O from day #1, but a tiny bit of foresight is all it takes. The difference is literally 1 vs 2 fn-s in the Read trait.

So all I'm saying is that the design of "all streams can be thought of as copying into a client-provided byte buffer, which is basically UTF8 enough of the time for this convenience method to be present" is just wrong. No amount of wishful thinking will ever make this the general case. Meanwhile, the general case supports that scenario smoothly and integer token streams being produced by lexers, and f32 streams being passed to DSPs, and RDMA at 100Gbps, and so on, and so forth...

I think one key point of the discussion here is that Rust provides a thin runtime which is close to the lowest-common-denominator OS API (in this case POSIX's byte-oriented IO), whereas C# provides a thick runtime which is close to programmer use cases (in this case structured IO). As much as I dislike this dichotomy, this is the textbook difference between a low-level programming language (annoying but predictable) and a high-level programming language (comfy but uncontrollable).

Now, of course, we could dream of an ideal world in which our programmer comfort would not be disturbed by crufty API design from the 70s. But if we cannot get that, the next best choice is to cater to both application devs and system devs using different tools. I would say that this is why having both Rust and C# is a good thing.

10 Likes