Ergonomics of a lexer and iterators

Hello!

I've been writing a toy lexer for a language. In my dirty prototype, I did something like this:

enum Token { Eof, SomethingElse, ... };

struct Lexer { ... }
impl Lexer {
    fn peek(&mut self) -> Fallible<Token> { ... }
    fn next(&mut self) -> Fallible<Token> { ... }
}

which lends itself to an API that is easy to use on the caller side thanks to the try operator:

match lexer.next()? {
  Token::Eof => break,
  Token::SomethingElse => ...,
}

But I'm now cleaning up the code and one thing that caught my attention was a Clippy warning about my next() implementation, because it should be implemented in terms of Iterator. Which sounds like a good idea because I can then rely on Peekable and more natural behavior.

The "problem" is that in converting to an iterator, next becomes:

fn next(&mut self) -> Option<Fallible<Token>>;

And I'm feeling that this has worse ergonomics on the caller side because now I have to do things like:

match lexer.next() {
    None => break,
    Some(Ok(Token::SomethingElse)) => ...,
    Some(Err(e)) => bubble error up,
    Some(Ok(_)) => return invalid token error,
}

The inability to use the try operator feels annoying. On the other hand, this design doesn't require a fictitious Eof token and it integrates better with the common interfaces for iteration.

So I'm curious: what do you think? Is there any good reason to not go for the iterator-based approach?

Thanks!

You can use Option::transpose to convert Option<Result<T, E>> to Result<Option<T>, E>, then use the try operator

3 Likes

Consider using fallible_iterator.

I had tried that, but transpose doesn't play well with peek because of the inner reference, so it didn't completely fix my "problems".

I had seen that, but I am trying to avoid as many dependencies as possible and trying to fit stuff into the "standard" interfaces. (I know, I know, my mention of Fallible doesn't align with what I just said :stuck_out_tongue:)

In that case fallible-iterator is exactly the abstraction you want.

My current favorite interface for getting the next token is fn next_token(&mut self) -> Token.

  • I don’t find it useful to single out eof condition. In the parser, you generally handle eof and any other wrong token the same. So, it’s clear to handle one unified error path for both conditions by having a dedicated eof token.
  • I think it’s better not to pollute the lexer with io::Errors. Working with &str is usually fine. Like, source files are usually smaller than 10 mb, so it’s not a problem to keep them in memory. If I had a requirement to work with arbitrary Read, I’d probably wrapped it into a type that just returns eof on error and communicates the actual error via a side channel.
  • similarly, I think it’s a good idea to not report lexer errors (malformed tokens) as a Result. Instead, one can return an explicit Error token, or emit error via diagnostic/side channel, or store error flags (like unclosed quotes) in the token itself — that way the lexer doesn’t need to stop after the first error.
2 Likes

Oh thanks, encoding the malformed tokens as a token itself sounds like a good idea. And splitting these conditions from I/O errors has already uncovered a couple of subtle issues in error handling :slight_smile: