TakeWhile iterator over chars to string slice


#1

Hi,

I’m writing a Lexer and I want it to do it with minimum allocations.
The TokenStream wraps over a Peekable Chars iterator and itself implements an iterator trait. i.e. it consumes a char iterator, while exposing a TokenIterator.

The comments are inline, basically I’m looking for a way to get a string slice from a TakeWhile Iterator, as I’ve seen that it’s possible todo so for the Chars one.

I think I might be missing some feature in Rust that would allow me to do that, ie use a method on the wrapped iterator.

Thanks

struct TokenStream<'a>
{
	it: std::iter::Peekable<std::str::Chars<'a>>
}


impl<'a> Iterator for TokenStream<'a> {
type Item = Token<'a>;

fn next(&mut self) -> Option<Token<'a>> {
	match self.it.peek() {
	    Some(&ch) => match ch {
	    	'0' ... '9' => {
                            // Error, cannot convert the TakeWhile iterator to as_str
                            // The Chars iterator has this method and it should return a string slice
                            // I'm looking how to make this with the iterators
                            // Do I have to implement the as_str for the TakeWhile Iterator?
                            // Or is there a way to access the as_str method of the Chars iterator
	    		Some(Token::Number(self.it.take_while(|a| a.is_numeric()).as_str()))
	    	},
	    	'+' => {
	    		self.it.next().unwrap();
	    		Some(Token::Operator(Symbol::Plus))
	    	},
	    	_ => Some(Token::End)
	    },
	    None => None
	}
}
}

The compiler Error:

error: no method named `as_str`     found for type `std::iter::TakeWhile<std::iter::Peekable<std::str::Chars<'a>>, [closure@src/main.rs:67:30: 67:48]>` in the current scope
  --> src/main.rs:67:50
   |
67 | 		    				self.it.take_while(|a| a.is_numeric()).as_str()))
   | 		    				                                       ^^^^^^

error: aborting due to previous error

#2

Here’s how I’d implement this with minimal changes to your code:

struct TokenStream<'a> {
    it: std::str::Chars<'a>,
}

#[derive(Debug)]
enum Token<'a> {
    Number(&'a str),
    Plus,
    End,
}


impl<'a> Iterator for TokenStream<'a> {
    type Item = Token<'a>;

    fn next(&mut self) -> Option<Token<'a>> {
        match self.it.clone().next() {
            Some(ch) => {
                match ch {
                    '0'...'9' => {
                        let str = self.it.as_str();
                        while self.it.clone().next().map_or(false, |ch| ch.is_numeric()) {
                            self.it.next();
                        }
                        Some(Token::Number(&str[..str.len() - self.it.as_str().len()]))
                    }
                    '+' => {
                        self.it.next();
                        Some(Token::Plus)
                    }
                    _ => {
                        self.it.next();
                        Some(Token::End)
                    }
                }
            }
            None => None,
        }
    }
}

fn main() {
    let mut ts = TokenStream { it: "123+456+789+z".chars() };
    println!("{:?}", ts.collect::<Vec<_>>());
}

Output:

[Number("123"), Plus, Number("456"), Plus, Number("789"), Plus, End]

Instead of std::iter::Peekable<std::str::Chars>, I’m using std::str::Chars directly and cloning it when I need to “peek”. Cloning std::str::Chars is pretty cheap – it’s just 2 pointers. In some lexers I wrote a long time ago I found cloning like this to be faster than using Peekable, at least for the Chars and CharIndices iterators.

You should probably extract the logic to extract the substring in the Number case to a separate function as you’ll most likely need to use it for other tokens as well.


#3

Thank you,

That’s exactly what I was looking for, I went temporarily with storing the indexes in the token as opposed to string slices, but this seems to be much better.