Idiomatic way to read chars

What's the idiomatic way to read chars (not graphemes) from a str, where the code is parsing and some operation that gets the next char is called from many places. Create an iterator and call .next() directly? Is there some way to make a Reader that works on chars?

(Parsing here means parsing something similar to JSON, where '{', '[', ':' are recognized and take the parsing through a recursive state machine. No look-ahead required.)

1 Like

Keeping an instance of Chars iterator would be the right solution. That's an efficient and direct way of getting chars.

It may be a bit awkward to use it from many places, due to it having a lifetime bound to its source string (you'll end up with self-referential struct problem if you try to bundle it with its string in the same struct). You'll likely need to pass it as an argument.

Alternatively, if your important syntax only uses ASCII characters, it's safe to parse a string as bytes, since UTF-8 guarantees that non-ASCII chars will never look like any ASCII byte. That will be the fastest solution, and pretty simple given that you can slice and index a byte slice directly.

4 Likes

You can write an iterator that will consume bytes from R:Read, interpret them as UTF-8, and yield chars:

use thiserror; // 1.0.43
use std::io::{Read, Bytes};

pub enum Utf8Reader<R>{
    Active(Bytes<R>),
    Fused
}

#[derive(thiserror::Error, Debug)]
pub enum Utf8ReadError {
    #[error("{0:?}")]
    IoErr(#[from] std::io::Error),
    #[error("Invalid UTF8 sequence")]
    DecodeErr(#[from] std::str::Utf8Error)
}

impl<R:Read> Utf8Reader<R> {
    pub fn new(source:R)->Self { Utf8Reader::Active ( source.bytes() ) }
}

impl<R:Read> Iterator for Utf8Reader<R> {
    type Item = Result<char, Utf8ReadError>;
    fn next(&mut self)->Option<Self::Item> {
        use Utf8Reader::*;
        
        let Active(source) = self else { return None; };
        
        let mut bytes = [0u8;4];

        for n in 0..4 {
            bytes[n] = match source.next() {
                None => {
                    // EOF
                    *self = Fused;
                    return match (n, std::str::from_utf8(&bytes[..n])) {
                        // EOF at char boundary
                        (0, _) => None,
                        
                        // Incomplete char at EOF
                        (_, Err(e)) => Some(Err(e.into())),
                        
                        // Returned in previous loop iteration
                        _ => unreachable!()
                    }
                },
                Some(Err(e)) => {
                    // I/O Error in reader
                    *self = Fused;
                    return Some(Err(e.into()));
                },
                
                // Byte available
                Some(Ok(b)) => b
            };
            
            match std::str::from_utf8(&bytes[..=n]) {
                // Complete char has been read
                Ok(s) => { return Some(Ok(s.chars().next().unwrap())); }
                
                // Invalid UTF-8 sequence in input
                Err(e) if e.error_len().is_some() => {
                    *self = Fused;
                    return Some(Err(e.into()));
                }
                _ => ()
            };
        }
        
        // 4 bytes is the maximum length of a UTF-8 sequence
        unreachable!() 
    }
}

Edit: It's also not that hard to manually decode the UTF-8, which is more error-prone but might be marginally more efficient as it doesn't require repeated decode attempts for multi-byte characters.


Edit 2: It feels a little bit odd that std provides something like this for decoding UTF-16, but not UTF-8.

2 Likes

Yes. This seems to work. Starting the parse:

pub fn from_str(notation_str: &str) -> Result<LLSDValue, Error> {
    let mut cursor = notation_str.chars().peekable();
    parse_value(&mut cursor)
}

This gives me a peekable cursor, which simplifies parsing.

The parser, simplified. This will parse "[123, 456]" into a parse tree. There are more types than shown here. This is a format much like JSON, but it pre-dates JSON.

/// Parse one value - real, integer, map, etc. Recursive.
fn parse_value(cursor: &mut Peekable<Chars>) -> Result<LLSDValue, Error> {
    /// Parse "iNNN"
    fn parse_integer(cursor: &mut Peekable<Chars>) -> Result<LLSDValue, Error> {
        let mut s = String::with_capacity(20);  // pre-allocate; can still grow
        //  Accumulate numeric chars.
        while let Some(ch) = cursor.peek() {
            match ch {
                '0'|'1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9'|'+'|'-' => s.push(cursor.next().unwrap()),
                 _ => break
            }
        }
        //  Digits accmulated, use standard conversion
        Ok(LLSDValue::Integer(s.parse::<i32>()?))
    }
    
    /// Parse "[ value, value ... ]"
    /// At this point, the '[' has been consumed.
    /// At successful return, the ending ']' has been consumed.
    fn parse_array(cursor: &mut Peekable<Chars>) -> Result<LLSDValue, Error> {
        let mut array_items = Vec::new();
        //  Accumulate array elements.
        loop {
            //  Check for end of items
            consume_whitespace(cursor);
            if let Some(ch) = cursor.peek() {
                match ch {
                    ']' => { let _ = cursor.next(); break } // end of array, may be empty.
                    _ => {}
                }
            }
            array_items.push(parse_value(cursor)?);          // parse next value
            //  Check for comma indicating more items.
            consume_whitespace(cursor);
            if let Some(ch) = cursor.peek() {
                match ch {
                    ',' => { let _ = cursor.next(); }   // continue with next field
                    _ => {}
                }
            }
            
        }
        Ok(LLSDValue::Array(array_items))               // return array
    }
    
    /// Consume whitespace. Next char will be non-whitespace.
    fn consume_whitespace(cursor: &mut Peekable<Chars>) {
        while let Some(ch) = cursor.peek() {
            match ch {
                ' ' | '\n' => { let _ = cursor.next(); },                 // ignore leading white space
                _ => break
            }
        }       
    }

    //
    consume_whitespace(cursor);                         // ignore leading white space
    if let Some(ch) = cursor.next() {
        match ch {
            '[' => { parse_array(cursor) }              // array
            'i' => { parse_integer(cursor) }            // integer
            _ => { Err(anyhow!("Unexpected character: {:?}", ch)) } // error
        }
    } else {
        Err(anyhow!("Premature end of string in parse"))  // error
    }
}
1 Like

Your code can be made a lot more idiomatic, and it contained a number of actual bugs (primarily in parsing arrays) that I have fixed: Playground

  • When parsing arrays, your current code accepts items that are not separated by a comma. After parsing a value, you should check whether it is followed by a comma or a ], but not any other character.
  • When a bunch of functions take a &mut T for the same type T, it indicates that you should probably be using a custom type with methods.
  • You should use char_indices() instead of chars() in order to be able to report the location of any eventual errors.
  • Using char_indices() also allows you to avoid collecting the characters of numeric values into a String, parsing a slice of the original string instead.
  • Instead of using peek() followed by .next().unwrap(), use pattern matching, if let, and/or Peekable::next_if().
  • Rather than matching on all individual characters for integers, use char::is_ascii_digit(). Similarly, use char::is_ascii_whitespace() for detecting and skipping whitespace.
  • LLSDValue should implement FromStr instead of (or at least in addition to) a free-standing from_str() function.
4 Likes

What I ended up doing:

Peekable, basically. Works.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.