Checkout the Chomsky hierarchy. The sexpr is a context-free grammar due to its recursive nature. Regular expression, or regex, is a tool to represent/parse regular grammar.
I agree with the cfg/regex statement, but it does not nullify what I am trying to do.
I do not need to express sexp as a regex. I only need to express symbols, numbers, (, ), [, ], ; as regex. Besides (* *), which requires 2-lookahead, I think the rest of sexp is just 1-look ahead. So it should be something like a
match [first two chars] {
_ -> read a sexp
_ -> read a sexp
_ -> read a comment
_ -> read a sexp
_ -> read a string
}
``
or something like that.
Lisps are simple enough that you don't even need regex for tokenizing.
I normally create some sort of Lexer struct that keeps track of the text and current index, then have a top-level method which will peek at the next character and run the corresponding lexer routine to parse the rest of the token (open/close parens, identifiers, etc.).
struct Lexer<'src> {
text: &'src str,
position: usize,
}
impl<'src> Lexer<'src> {
pub fn next_token(&mut self) -> Result<Token<'src>, LexerError> {
match self.next_char() {
'(' => self.tokenize_open_paren(),
')' => self.tokenize_close_paren(),
letter if letter.is_alphabetic() => self.tokenize_ident(),
digit if digit.is_numeric() => self.tokenize_number(),
None => return Ok(Token::eof()),
}
}
/// A helper method that will consume text and advance the `Lexer`
/// until the `predicate` returns `false`. Returns `None` if the predicate
/// is never satisfied or the lexer is at EOF.
fn take_while(&mut self, predicate: impl FnMut(char) -> bool) -> Option<&'src str> { ... }
}
struct Token<'src> {
kind: TokenKind,
text: &'src str,
}
enum TokenKind {
OpenParen,
CloseParen,
Ident,
EOF,
...
}
You'll probably have a big match statement with regular expressions anyway, so writing everything yourself isn't such a big deal. This code is all mostly mechanical and trivial to test, so I find the lexer to be pretty easy to maintain. The regex crate is also a pretty big dependency, so you'll improve compile times by avoiding it.
Once you have a stream of tokens, the easiest way to turn that into an Abstract Syntax Tree that can be executed/analysed is with a recursive descent parser. This isn't something regex can help you with because regular expressions can only deal with "regular" languages (i.e. syntax with no recursion).
The place I intend to use regexes is on the self.tokenize_() lines. I think these are the places where dropping a regex makes life eaiser. For example, comment is:
Fair enough. I personally find the take_while() version to be more readable than a regex.
impl<'src> Lexer<'src> {
pub fn next_token(&mut self) -> Result<Token<'src>, LexerError> {
...
match self.next_char() {
';' => self.tokenize_comment(),
...
}
}
pub fn tokenize_comment(&mut self) -> Result<Token<'src>, LexerError> {
// Note: We've just peeked and seen a ";". Now we read until
// the end of the line or file.
let text = self.take_while(|c| c != '\n').unwrap_or_default();
Ok(Token::new(TokenKind::Comment, text))
}
}
I can tell you how I do it (in all my lexers in all languages), I never use regexps for "serious" parsing (and many of the "unserious" parsers I had to change after some time, for example I should really refactor one existing sexp parser ;):
Simplified logic:
fn readNumber() {
let try_decimal_int = take_while(is_decimal_digit);
if is_decimal_separator(peek_char()) {
continue_reading_float(try_decimal_int)
} else {
DecimalInt(to_integer(try_decimal_int))
}
}
And a serious warning: number literals may by in localised format (the decimal or thousands separator may be a "." or a "," or a ...) and you do not want to have to maintain such localisations in regexps.
And regexps don't let you skip e.g. thousands separators directly. Like in 1_000_000.
If you added handling negative numbers as well as handling exponents, is this > 10 lines of code? I think a one line regex is actually easier to debug.
If you prefer regex, just use regex and see for yourself if you like it. To each their own. That's a valid way of doing it.
Personally I mostly agree that it's easier to write loops. With regex you also have to write a few lines of code to compile it, handle errors, get the value out of the regex, convert the value to the right type, etc. And it is more restrictive in case you want to add anything else to the logic (count characters while you're scanning, have some exceptions, etc).
In your floating-point-number example, if you're going to parse the number yourself (i.e. get the integer part, the fractional part, the exponent, etc) you're going to write 10+ lines of code regardless of how you do it. If you're just validating and then splitting the prefix and delegating the parsing to another function, then it might make sense to use regex for this. But then, it would make even more sense for that other function to split the prefix itself (unfortunately FromStr isn't designed that way, shame).
Regarding lex, that's different in the sense that the whole lexer is defined in a domain specific language -- then regexes are more valuable, you can't write loops in a lex file.
If it's something simple like parsing a float like 123.45, you could take advantage of the fact that it accepts a FnMut and create a seen_decimal_place flag.
Otherwise, you can make a simple state machine or break it up into multiple take_while() calls.
I'm normally okay with writing extra lines of code if it means I can avoid a dependency or have control over things like token spans and error messages (e.g. expected a digit or ".", but found "$" is a lot more user-friendly than syntax error at "$").
I prefer to handle leading negatives as part of the parser rather than part of the floating point literal, but that's not a lexer thing - I just like being able to point at them separately in error messages and have unsigned literals.