Creating a custom lexer for LALRPOP

My code: GitHub - madebyjeffrey/ParseTest: Parse Test is an experimental project to do a little more rust code and create an assembler eventually LALRPOP library.

I am starting to design an assembler for a fantasy/virtual cpu I am in the process of creating.

After discovering that LALRPOP doesn't support customization of the lexer and I need line endings for the assembler I had to figure out how to make a custom lexer. The documentation was pretty nice and had some help from one of the community discord servers on some issues as I am coming back to rust after some time.

The specific review I would like is: is there any better way to do the lexer? But anything else is fair game, I do not expect any of it to be great code but all of it has tests (some weirdness in cp437_input is to help with testing).

1 Like

Just for information as a side note, the other classical way to do a lexer is transforming a set of regular expressions into a DFA (Deterministic Finite Automaton—a finite state machine) and, from there, either a direct translation to source code of the state machine or using a lexer based on a transition table. The best source I've found on the subject is the Dragon Book (Compilers: Principles, Techniques, & Tools, 2nd Edition, by Aho, Lam, Sethi, and Ullman, chapter 3—in particular, section 3.9.5).

Typically, those regular expressions are parsed by a tool from a human-readable file describing the lexicon in terms of regular expressions (Id: [a-zA-Z][a-zA-Z]*). Lexer generators like lex / flex do that transformation for you and generate source code (this one is quite old and produces C; I've written one that produces Rust, but it's not public just yet).

The advantages are that you can modify that lexicon without having to change all your code, and you have the option of a table-based lexer. Also, you also have a clear view of your lexicon, since the source is human-readable, so it lowers the risk of errors, especially when the lexicon changes over time. The disadvantage is that it leaves less room for customization if you need to do something special which isn't supported by the lexer generator.

There's also some debate about whether a generated source code FSM or a table-driver lexer is quicker. I think it depends on the size of the lexer, the extent of its alphabet, and the context (data types, respective CPU cache sizes, etc). But it's clear that if you're hand-writing the code instead, you can optimize it at will. :slight_smile:

However, I wouldn't bother with all that for assembly language, which is rather straightforward. I think the way you did it is simple and worry-free, but it's worth keeping in mind in case you have to extend the lexer repeatedly or if some tokens become cumbersome to handle with a match.

For the code itself,

  • I don't know LALRPOP, but doesn't it support errors? I see you're skipping any illegal character, which may not be a good thing. It's better to produce an error so that the user isn't confused if something goes wrong because of a typo.
  • Potential improvement: your code is panicking if the input is incorrect; for example, a malformed hex number. Again, if an error path exists, it's best to use it and report the wrongly formed number in a log. Chances are the parser is able to recover and parse more of the source file to report further errors to the user, allowing them to correct the whole file in one go rather than seeing a succession of crashes.
  • Nit-picking: it would seem more logical to name is_letter as is_letter_or_underscore, and safer to use it in is_letter_or_digit_or_underscore rather than re-writing the same code. I'm not entirely sure the compiler will inline this automatically, so something to check or to test in the Compiler Explorer, maybe.
  • You could also make all those methods as a trait implemented for char rather than C-like functions. Actually, I'm not sure why you redefine some like is_hex_digit and is_digit rather than using the native ones and using the existing name convention for the other ones you're defining.
  • For the tests, I see several unit tests, if not all of them, following the same pattern: you could merge them in one test by creating a let test: Vec<(&str, Vec<Tok>)> = ...: a list of inputs and their expected tokens. It will give you a clearer view of the tests, and it will incite you to put more of them, since you don't have to create a whole function for each input.
1 Like

I had a quick look (I could've seen that in your source the first time...), and saw that LALRPOP takes an iterator over

pub type Spanned<Tok, Loc, Error> = Result<(Loc, Tok, Loc), Error>;

and a possibility to enable error recovery in the parser at the cost of a little more code in the lexer. I suppose you saw that, but I have no idea of how much you know about lexers and parsers, so sorry if that sounds trivial.

Some paths already return that error, so I'd recommend to at least change the remaining panics to error tokens or at least errors; e.g. translating from_str_radix() into either an Ok with a dedicated token for a lexical error, or an Err with the appropriate error type instead of unwrapping it. Maybe a simple ? even works if they handled that error type in LALRPOP.

That would stop the parser on the first error, but showing at least a proper log error to the user instead of crashing entirely with a confusing message.

The next step would be to support the error recovery in the parser, for what it's worth, so that it can try and parse the rest of the source and potentially detect other errors. There's a series of error recovery strategies, the better ones involving changes in the grammar, but the simple ones handled automatically by the parser are usually just fine (it consists in skipping tokens or parts of a rule until it can resynchronize).

I'm not quite sure what you're after by a "a better way to do this" but if you're just trying to get a high quality lexer quickly I can't recommend logos enough:

If this is more "how do I lex and parse in general" I'm personally more on the "just write the code, it's not that hard, just a bit messy" side: fundamentally lexing is just a function with the (moral) signature:

(input: &str) -> {
  token: Token,
  remaining: &str,
}

(you generally shouldn't "fail" a lex in my opinion, instead return an error token and advance the input) and parsing with the very similar:

(input: Lexer) -> Result<{
  value: T,
  remaining: Lexer,
}>

There's a lot of detail, of course: parsers generally need the ability to "soft" fail so you can backtrack, but also to "hard" fail to improve error messages when you know no other path would work; you may want to send parse events rather than build an explicit AST; and parsing binary expressions with precedence is probably worth explicitly handling with eg a Pratt parser to reduce boilerplate. But overall it pretty quickly starts to read very naturally if you're careful about how you design your parser interface: you actually either want a mutable Input or returning a pair for let (foo, input) = parse_foo(input)?;

You will probably want to build some "generic parsers" like "comma separated list of (other parser)" - which if you go all in on you get parser combinator libraries like nom which can parse down to the byte, eliminating the need for a lexer... except that you often then need to recreate one due to expectations for keywords and operators, you don't want to parse "whilex" as "while x" for example.

1 Like

Thank you for mentioning logos. It looks pretty good. I was looking into creating a parser combinator library to learn how they are put together, but a recursive descent parser would also work too.

I think I will look at logos for a lexer as it seems like the right tool and it does what I was thinking of originally.

I will see if I can integrate it with LALRPOP or look at a recursive descent parser — as it is a relatively small language.

You could easily parse assembly language with a relatively small recursive descent parser, so it's not a bad idea.

If you're looking for a good book on recursive descent parsers, I can't recommend enough Writing a C Compiler by Nora Sandler. It goes a little beyond what you need, but not too much if you consider adding macros.

If you have to parse expressions to evaluate constants with such a parser, check the precedence climbing method to solve the ambiguities that comes with binary operators (plus precedence and left-/right-associativity). It's simple and will avoid you headaches. It's covered in that book, but it's easy to find online too, like here.[1]


  1. I see that article mentions the old Clarke method in the references. It's in fact a little different and applies wonderfully to LL(1) non-recursive parsers, but I think you can deduce the precedence climbing algorithm from there, too. ↩︎

1 Like