Rust beginner notes & questions

I’m curious if you’ve been able to get over your std::io::Read gripe and look at Rust some more, beyond your initial post in this thread.

I also think it’s incredibly unrealistic to expect a language and/or its stdlib to be flawless, regardless of how many have come before. Doubly so if you actually intend for people to use it rather than sit in someone’s imagination.

2 Likes

Perhaps you could start an RFC and we could all iterate on it to create a "proper" Stream API for Rust? I think you've pointed out a lot of really good ideas and points that should be addressed. I pretty much agree with your analysis, though, I would not currently have had the foresight to so succinctly categorize the issues.

1 Like

Rust has many warts. Its feature-set does appear less cohesive and consistent when compared to some languages (C# comes to mind). There are countless other problems mainly due to its youth. Then there's the borrow checker. So feeling negative emotions is almost a rite of passage for a Rust beginner. I don't want to belittle your feelings, but in my experience once you get past this initial despair and when you get to coding for production instead of doing toy programs is when you come to appreciate what a life saver and how brilliant this language is. This is because Rust's strengths which far outweigh its faults unfortunately become apparent only when you've done any real-world work in it. And that'll perhaps forever be Rust's curse.

8 Likes

C# is a very high bar, it's one of the best designed languages out there (still, I think Rust is better for my usages).

3 Likes

I love Rust - let me preface with that.

Unfortunately, I think the tears at the seams (i.e. seeming incohesion/inconsistency) are visible at both the early/beginner stage and also at a later stage, although they're for different reasons. I am very hopeful that, over time, they'll be ironed out to a point where they're barely noticeable.

That said, no language is perfect. Rust is doing something novel, certainly so for any language that can be called mainstream. I think it's understandable that it'll have some growing pains, both at the lang and stdlib levels. There's just no way around it. I think the communities' (and Rust core teams') priorities align very well with mine (i.e. robustness, expressiveness, correctness to borderline pedantry, and performance).

Rust will definitely not be for everyone, just like no other language is universally praised or liked. The people that will like it are the ones holding the same core values as Rust and willing to put in the time to learn it, with all its quirks and idiosyncrasies.

4 Likes

Absolutely. I would the say the same about C#. That said, one thing that can be mentioned in Rust's defense is that C# has GC which makes a lot of decisions easier. We only have to look at the state of the art prior to Rust when it comes to non-GC languages to appreciate that Rust is a huge improvement. That said (recursively), not all of Rust's difficulty or unsightly parts stem from the memory-management challenge it's set for it itself.

4 Likes

Yup. The rough edges don't really go away with experience. However, you kinda learn to live with them since you get so much in return.

I also happen to really like C#, and used to use it quite extensively. But, it's also not perfect :slight_smile:. The bifurcation of reference vs value types is there, and there are some footguns with using value types. Some of you might remember how lambdas used to desugar in for loops, capturing the value only for the last iteration in some cases (that was fixed at some point). The .NET standard lib used to be extremely allocation happy and it was a challenge to write performant code. Although methods were sealed by default, classes weren't which ought to have been the better default. C# made the same mistake as Java of having a volatile field keyword, which is completely backwards in today's thinking - it's not the field that's volatile, but accesses to it (and each of those may have their own memory ordering requirements to boot). There was no good memory model for the language (not sure if there is one now, on par with say Java's memory model). null is still present. Exceptions are unchecked, which is fine if you hate incorrect usage of Java style checked exceptions, but makes it incredibly difficult to write robust code. And so on.

7 Likes

It's definitely not all, but it's very close to it I think. Particularly if you generalize "memory management" to soundness. A lot of the difficulty is being a low level language trying to marry high level features while being sound at compile time. That's a very hefty (and praiseworthy!) goal.

4 Likes

I think to call C# a really well-designed language is a bit of an overstatement. As far as I'm concerned (and I like both Java and C# for what they are) it is just Java with a little better support for value types. They frankly got it wrong with exceptions (as you mentioned). They got it wrong in how they handle "null" (as you mentioned). They got it wrong wrt to volatile (as you mentioned). It really is only a marginal, at best, improvement (and that is even debatable) over Java. I really can't see much real advantage of C# over Java for most cases. Java tends to push as much as possible to libraries/JDK whereas C# tends to incorporate new language features more often, but, I really don't see that one is necessarily that much better or worse than the other. I prefer Java exception handling over C# exception handling, but, I like the Rust way of error/alternative handling even better. I think Rust is getting A LOT Of things right, but, there is definitely room for improvement and the comments by the OP can help inform the discussion (even if they do, at first reading, come off a little snide or combative).

5 Likes

IMO, C# is a well-designed language. Is it perfect? No, as mentioned. But I don't know any perfect language. I've not followed it too closely in the last few years, but I recall in the beginning there was nice consistency and "flow" to features added in version N and how they enabled something else in version N+1. There's a lot right about C# if you don't mind a GC/JIT/managed runtime.

To call C# "just Java with a little better support for value types" is ... disingenuous at best :slight_smile:. I really don't want to sidetrack this thread into Java vs C# (or Rust vs C#, for that matter), so I'll stop here. But I've used C# and Java extensively, and their comparison ends right on the surface for me.

5 Likes

std::io::Read doesn't forces utf8. In fact, it does not imply any encoding - It's just a stream of bytes. It can be a utf8 encoded text file from local disk, euc-kr encoded html from gunzip stream, or even a jpg encoded picture of kitten from the internet.

Read is used for low level abstraction in io context. It only cares bytes, because everything in memory are bytes! Arbitrary typed generic iterator, which should be std::iter::Iterator, can be constructed on top of it.

I think what makes you feeling such is std::io::BufRead::read_line(), which assumes that input stream is utf8-encoded. This is just a simple shortcut for common case, as most streams we handle line-by-line are utf8 encoded. But if it's not your case, you can always bypass such highlevel api and handle bytes directly.

2 Likes

Read in std::io - Rust is what doesn't really belong there, but I suspect was added as a convenience. An implementation that doesn't have UTF8 strings internally can return an error for that method, but that method ought to not be there in the first place.

1 Like

I honestly wish I could do exactly that, but I don't use Rust enough to really contribute meaningfully. I've dabbled with it just long enough to determine that it won't help me in any future projects.

Right now, for the kind of work I'm doing, the runtime overhead of C# for me is relatively unimportant compared to its productivity, which is the best of any language I've personally used. My next step up would be switching to F# on dotnet core, as that would both boost my productivity and performance significantly. The extra 20%-50% runtime performance from Rust just isn't worth it compared to the drop in productivity.

For example, Rust Windows interop is... not pretty right now. There just isn't the same kind of pre-packaged, ready-to-use wrappers around the Win32 APIs that C# has. Does it have the ability to call COM yet? DCOM+? Can you create a socket server with Active Directory Kerberos authentication? Can I validate a certificate against the machine trust store? Last time I checked, there were blocking issues for most of my use-cases, and to be honest I gave up after getting bogged down in all the niggling little issues related to UCS-16 string handling.

At the end of the day, 90% of desktops are still Windows, and well over 50% of all enterprise servers run it too. Rust is very Linux/POSIX centric. All the performance or safety in the world doesn't help if I can't get off the ground and make productive progress on a useful project...

2 Likes

Regarding Read and UCS-16: you always can write an extension trait which will implement convinience UCS-16 methods while using raw bytes IO under the hood. Should UCS-16 methods or methods which will accept different encodings be in the std? Personally I don't think so, but it's a good idea for a crate. (maybe it already exists?)

Have you never written a parser?

What happens when you read a stream of bytes that's actually UTF-16 encoded?

You get a stream of 16-bit codepoints. Not bytes.

Then if you wish to parse this further with a lexer, you'll get a stream of tokens, typically 32-bit integers. Not bytes.

Not everything is byte, that's why we have strongly typed languages.

Not everything that streams large chunks of contiguous data around is a POSIX file handle and returns 32-bit integer I/O error codes.

In my mind, the ideal trait inheritance hierarchy ought to look something like the following:

// A stream is just a "fat" iterator.
pub trait Read : Iterator {
    type Error=();

    // Shamelessly copying the C# Pipeline concept here
    fn read( &mut self, required_items: usize = 0 ) -> Result<&[Self::Item],Self::Error>;
    
    // Ditto.
    fn consume( &mut self, items_used: usize );
    
    // A stream *really is* an Iterator, allowing fn next() to have a default impl in terms of stream functions!
    // Now if "impl trait" was used in Iterator's fns, Read could *specialise* things like fn peekable() and the like
    // with versions optimised for streams...
    fn next(&mut self) -> Option<Self::Item> {
        if let Ok(b) = self.read( 1 ) {
            self.consume( 1 );
            return Some(b[0]);
        }
        else {
            return None;
        }
    }
}

pub trait AsyncRead : Read { 
    // ... Futures-based async versions of fn read() goes here ...
}

// Defaults to bytes, but doesn't force it!
pub trait IORead<Item=u8,Error=i32> : AsyncRead {

    fn close( &mut self );

    fn seek( &mut self, position: u64 );
    
    // ... other functions that are more specific to file descriptors / handles ...
}

Now imagine that you want to parse an XML file with an unknown encoding. Right now, this is... icky in most languages, because you have to read a chunk of the header, try various encodings to find the bit that says what encoding the file is in, then restart from the beginning using a wrapper that converts from bytes to characters. But you've already read a bunch of bytes, so now what? Not all streams are rewindable!

With something like the new C# Pipeline I/O API, the low-level parser would start off with a Read<Item=u8>, make the encoding decision, and then the high-level XML parser could use Read<Item=char>. The encoding switch at the beginning would be very neat because you just don't call consume(); This would work fine even on forward-only streams such as a socket returning compressed data.

Similarly, if the String type was instead a trait that &[char] mostly implemented, zero-copy parsers would be fairly straightforward with this overall approach...

Behind the scenes, advanced implementations could keep pools of buffers and use scatter/gather I/O for crazy performance. The developer wouldn't even have to know...

This is what the new C# I/O API is trying to do, but it's not using the power of template programming to the same level that Rust could. Compare the C# Iterator<T> interface to the Rust Iterator trait. It's night & day!

3 Likes

In tokio land, you'd implement this with a Decoder layered on top of a raw byte stream (at the lowest level, this is always the type of the stream). The decoder would turn the bytes into whatever higher level type you want, and consumers would work off streams that are decoded underneath. This is all type-safe and uses generics extensively, so gets the optimization/codegen benefits of that. You can then also take a decoded stream/sink and split it into a read and write halves, if you want to operate over the duplex separately. Perhaps you can look at tokio/futures and see if you like it better.

2 Likes

How do you plan represent UTF-8 in such approach?

Your posts seems to be written under assumption of fixed-sized encodings. I understand that you come from the Windows world, but Rust has made a consious desicion to use UTF-8 as the main string encoding and supporting all other kinds of endoings in the std will just lead to bloat. And if I understood your proposal it will result in needless compexity in a lot of the code.

Do we need Windows-oriented ecosystem of crates? Yes, of course. Rust provide excellent tools for developing them. But I don't think that it's reasonable to expect introduction of drastic changes to Rust core which will make Windows developers a bit happier, but will create a ton problems for others.

2 Likes

The problem of Iterator-only approach is, it doesn't scale well to low-level. Rust is a system programming language. Common scenario of such io is memcpy incommimg bytes from os-managed buffer to my own, and parse that byte array to produce meaningful types. How can we model this operation with Iterator? Copy memory byte-by-byte is slower than memcpy over 10 times. Expose slice of internal buffer has lifetime issue as this buffer should be reused. Vec implies heap allocation for every read(), which cannot be acceptable.

ps. Rust's char type is 4 byte integer, to represent full range of unicode scalar values

ps2. I did have implemented a parser a while ago. Try check here :smiley: https://github.com/HyeonuPark/Nal

2 Likes

ripgrep supports searching either UTF-8 or UTF-16 seamlessly, via BOM sniffing. My Windows users appreciate this. The search implementation itself only cares about getting something that implements io::Read. UTF-16 handling works by implementing a shim for io::Read that transcodes UTF-16 to UTF-8. I did this in about less than a day's worth of a work and it was well worth it.

19 Likes