Writing lisp/sexpr parser from scratch using only regex

I know Rust provides various parser builder DSLs. I know Rust provides many crates for parsing sexp/lisp.

Question: Is there a straight forward way to implement a lisp/sexpr parser from scratch, using only std & regex ?

Nested (* *) comments can not be expressed as regex and needs to be handled separately. Strings with escape chars might also need custom handling.

For everything else, can we tokenize it using merely regex ?

Checkout the Chomsky hierarchy. The sexpr is a context-free grammar due to its recursive nature. Regular expression, or regex, is a tool to represent/parse regular grammar.

3 Likes

I agree with the cfg/regex statement, but it does not nullify what I am trying to do.

I do not need to express sexp as a regex. I only need to express symbols, numbers, (, ), [, ], ; as regex. Besides (* *), which requires 2-lookahead, I think the rest of sexp is just 1-look ahead. So it should be something like a

match [first two chars] {
  _ -> read a sexp
  _ -> read a sexp
  _ -> read a comment
  _ -> read a sexp
  _ -> read a string
}

``

or something like that.

So you are writing a lexer and not a parser?

Lisps are simple enough that you don't even need regex for tokenizing.

I normally create some sort of Lexer struct that keeps track of the text and current index, then have a top-level method which will peek at the next character and run the corresponding lexer routine to parse the rest of the token (open/close parens, identifiers, etc.).

struct Lexer<'src> {
  text: &'src str,
  position: usize,
}

impl<'src> Lexer<'src> {
  pub fn next_token(&mut self) -> Result<Token<'src>, LexerError> {
    match self.next_char() {
      '(' => self.tokenize_open_paren(),
      ')' => self.tokenize_close_paren(),
      letter if letter.is_alphabetic() => self.tokenize_ident(),
      digit if digit.is_numeric() => self.tokenize_number(),
      None => return Ok(Token::eof()),
    }
  } 
  
  /// A helper method that will consume text and advance the `Lexer` 
  /// until the `predicate` returns `false`. Returns `None` if the predicate 
  /// is never satisfied or the lexer is at EOF.
  fn take_while(&mut self, predicate: impl FnMut(char) -> bool) -> Option<&'src str> { ... }
}

struct Token<'src> {
  kind: TokenKind,
  text: &'src str,
}

enum TokenKind {
  OpenParen,
  CloseParen,
  Ident,
  EOF,
  ...
}

You'll probably have a big match statement with regular expressions anyway, so writing everything yourself isn't such a big deal. This code is all mostly mechanical and trivial to test, so I find the lexer to be pretty easy to maintain. The regex crate is also a pretty big dependency, so you'll improve compile times by avoiding it.

Once you have a stream of tokens, the easiest way to turn that into an Abstract Syntax Tree that can be executed/analysed is with a recursive descent parser. This isn't something regex can help you with because regular expressions can only deal with "regular" languages (i.e. syntax with no recursion).

  1. I completely agree with your high level approach.

  2. The place I intend to use regexes is on the self.tokenize_() lines. I think these are the places where dropping a regex makes life eaiser. For example, comment is:

;[^\n]*\n

symbol is:

[a-zA-Z_][a-zA-Z_0-9]*

etc ...

Fair enough. I personally find the take_while() version to be more readable than a regex.

impl<'src> Lexer<'src> {
  pub fn next_token(&mut self) -> Result<Token<'src>, LexerError> {
    ...

    match self.next_char() {
      ';' => self.tokenize_comment(),
      ...
    }
  }

  pub fn tokenize_comment(&mut self) -> Result<Token<'src>, LexerError> {
    // Note: We've just peeked and seen a ";". Now we read until 
    // the end of the line or file.
    let text = self.take_while(|c| c != '\n').unwrap_or_default();

    Ok(Token::new(TokenKind::Comment, text))
  }
}
1 Like

This is kind of mean, but how do you parse numbers ?

It may or may not start with a "-"

There may or may not be a "." in the middle of the number.

The number may or may not have a "e###" or "E###" at the end, and the "###" may or may not have a "-" in it.

take_while seems to be great as long as we have a regex of the form [list of chars]* , but seems to have problems handling or

I can tell you how I do it (in all my lexers in all languages), I never use regexps for "serious" parsing (and many of the "unserious" parsers I had to change after some time, for example I should really refactor one existing sexp parser ;):

Simplified logic:

fn readNumber() {
    let try_decimal_int = take_while(is_decimal_digit);
    if is_decimal_separator(peek_char()) {
       continue_reading_float(try_decimal_int)
    } else {
      DecimalInt(to_integer(try_decimal_int))
    }
}

And a serious warning: number literals may by in localised format (the decimal or thousands separator may be a "." or a "," or a ...) and you do not want to have to maintain such localisations in regexps.

And regexps don't let you skip e.g. thousands separators directly. Like in 1_000_000.

1 Like

If you added handling negative numbers as well as handling exponents, is this > 10 lines of code? I think a one line regex is actually easier to debug.

Small note: the regex-lite now exists. It's nearly a drop-in replacement, but compiles much more quickly and contributes a lot less to binary size.

(Not that I disagree with avoiding regex here anyway.)

1 Like

The 3 negatives sounds like you are NOT in favor of using regex here.

Using regex for tokenizing seems to go back as far as lex & yacc. See sample code in Lex (software) - Wikipedia

What is your argument against it ?

If you prefer regex, just use regex and see for yourself if you like it. To each their own. That's a valid way of doing it.

Personally I mostly agree that it's easier to write loops. With regex you also have to write a few lines of code to compile it, handle errors, get the value out of the regex, convert the value to the right type, etc. And it is more restrictive in case you want to add anything else to the logic (count characters while you're scanning, have some exceptions, etc).

In your floating-point-number example, if you're going to parse the number yourself (i.e. get the integer part, the fractional part, the exponent, etc) you're going to write 10+ lines of code regardless of how you do it. If you're just validating and then splitting the prefix and delegating the parsing to another function, then it might make sense to use regex for this. But then, it would make even more sense for that other function to split the prefix itself (unfortunately FromStr isn't designed that way, shame).

Regarding lex, that's different in the sense that the whole lexer is defined in a domain specific language -- then regexes are more valuable, you can't write loops in a lex file.

2 Likes

If it's something simple like parsing a float like 123.45, you could take advantage of the fact that it accepts a FnMut and create a seen_decimal_place flag.

Otherwise, you can make a simple state machine or break it up into multiple take_while() calls.

I'm normally okay with writing extra lines of code if it means I can avoid a dependency or have control over things like token spans and error messages (e.g. expected a digit or ".", but found "$" is a lot more user-friendly than syntax error at "$").

I prefer to handle leading negatives as part of the parser rather than part of the floating point literal, but that's not a lexer thing - I just like being able to point at them separately in error messages and have unsigned literals.

1 Like