Rust beginner notes & questions

Have you never written a parser?

What happens when you read a stream of bytes that's actually UTF-16 encoded?

You get a stream of 16-bit codepoints. Not bytes.

Then if you wish to parse this further with a lexer, you'll get a stream of tokens, typically 32-bit integers. Not bytes.

Not everything is byte, that's why we have strongly typed languages.

Not everything that streams large chunks of contiguous data around is a POSIX file handle and returns 32-bit integer I/O error codes.

In my mind, the ideal trait inheritance hierarchy ought to look something like the following:

// A stream is just a "fat" iterator.
pub trait Read : Iterator {
    type Error=();

    // Shamelessly copying the C# Pipeline concept here
    fn read( &mut self, required_items: usize = 0 ) -> Result<&[Self::Item],Self::Error>;
    
    // Ditto.
    fn consume( &mut self, items_used: usize );
    
    // A stream *really is* an Iterator, allowing fn next() to have a default impl in terms of stream functions!
    // Now if "impl trait" was used in Iterator's fns, Read could *specialise* things like fn peekable() and the like
    // with versions optimised for streams...
    fn next(&mut self) -> Option<Self::Item> {
        if let Ok(b) = self.read( 1 ) {
            self.consume( 1 );
            return Some(b[0]);
        }
        else {
            return None;
        }
    }
}

pub trait AsyncRead : Read { 
    // ... Futures-based async versions of fn read() goes here ...
}

// Defaults to bytes, but doesn't force it!
pub trait IORead<Item=u8,Error=i32> : AsyncRead {

    fn close( &mut self );

    fn seek( &mut self, position: u64 );
    
    // ... other functions that are more specific to file descriptors / handles ...
}

Now imagine that you want to parse an XML file with an unknown encoding. Right now, this is... icky in most languages, because you have to read a chunk of the header, try various encodings to find the bit that says what encoding the file is in, then restart from the beginning using a wrapper that converts from bytes to characters. But you've already read a bunch of bytes, so now what? Not all streams are rewindable!

With something like the new C# Pipeline I/O API, the low-level parser would start off with a Read<Item=u8>, make the encoding decision, and then the high-level XML parser could use Read<Item=char>. The encoding switch at the beginning would be very neat because you just don't call consume(); This would work fine even on forward-only streams such as a socket returning compressed data.

Similarly, if the String type was instead a trait that &[char] mostly implemented, zero-copy parsers would be fairly straightforward with this overall approach...

Behind the scenes, advanced implementations could keep pools of buffers and use scatter/gather I/O for crazy performance. The developer wouldn't even have to know...

This is what the new C# I/O API is trying to do, but it's not using the power of template programming to the same level that Rust could. Compare the C# Iterator<T> interface to the Rust Iterator trait. It's night & day!

3 Likes