Evaluating an expression in a procedural macro call

I am trying my hand at writing a compiler in Rust for my compilers class. The first project is to write a lexer for our language. So I came across Logos which seems to be a popular lexer crate. In Logos, you specify a pattern to match with procedural macros, for example:

#[regex("[a-zA-Z]+")]
Text,

inside of your Token enum.

Flex, the C(++) lexer-generator, allows you to define macros, like so:

TEXT [a-zA-Z]+
%%
{TEXT} {return 1;}

I'm looking to get this functionality with Logos to help make one of my regular expressions a bit easier to understand. The first thing I have tried is code that looks like this:

const COMMENT : &'static str = "^#.*$";

#[derive(Logos, Debug, PartialEq)]
enum Token {
    #[regex(format!("{COMMENT}|[[:space:]]"), logos::skip)]
    #[error]
    Error,
}

However, I get the error:

error: expected literal
   --> src/main.rs:140:13
    |
140 |     #[regex(format!("{COMMENT}|[[:space:]]"), logos::skip)]
    |             ^^^^^^

After reading about procedural macros, this error makes sense as I understand that they take just straight, unevaluated streams of tokens and manipulate them. With that, is there any way to get the compiler to evaluate the format! expression to a literal stream of tokens before passing it to the regex macro? Or, does Logos have a better way to do what I'm trying to do?

You can't evaluate expressions at macro expansion time, there's no way around that (as far as I know).

I'm not sure why you are trying to do this. The variant of enum Token already has a name – I wouldn't consider it any "cleaner" if the regex were provided using an external named entity. The name of the variant should be sensible enough so as to communicate clearly what the purpose of the token is. If that is the case, I think you should just go ahead and put the regex in the attribute as a string literal, like the example in the documentation itself.

COMMENT wasn't the best example. A more complicated token that I need to recognize is a STRLITERAL. A string literal is a sequence of 0 or more "string characters" enclosed in double quotes, where a "string character" is either:

  • A backslash followed by n, t, ", or \, OR
  • Any single character other than the newline, double quote, or backslash.

The string literal representing the regex that I have come up with to match this token is: "\"((\\\\[nt\"\\\\])|[^\\n\"\\\\])*\"". This is a bit unwieldy.

It would be nice if I could do something like this instead:

QUOTE = "\""
STRCHAR = r"(\\[nt{QUOTE}\\])|[^\n\\{QUOTE}]"
STRLITERAL = r"{QUOTE}({STRCHAR})*{QUOTE}"

I think the second way is more digestable.

If what you're suggesting is to just make COMMENT and STRCHAR variants of Token, that would be a good idea, but the requirements of the assignment explicitly enumerate what tokens we need to have, and these are not in the list. The lexer is supposed to just ignore comments and only tokenize full string literals. Also, having STRCHAR as its own token would cause actual programming language tokens like + and - to be take precedence, even when they're inside double quotes. Then it would become the parser's job to decide that "+" is a string literal, but I want this to happen during lexing. I suppose I may be able to use callbacks to get around this somehow.

I have just learned about the r#""# syntax for raw strings. I think that will make it simple enough to just use one regex for the full string literal.

It’s a bit inflexible and the nesting becomes a bit insane, but it’s somewhat possible to do what you want using existing macros from crates I found, in particular the macro with_builtin_macros::with_builtin - Rust from @Yandros makes this possible:

use with_builtin_macros::with_builtin;
use logos::Logos;

with_builtin! {
let $quote = concat!("\"") in {
with_builtin! {
let $strchar = concat!(r"(\\[nt", $quote, r"\\])|[^\n\\", $quote, "]") in {
with_builtin! {
let $strliteral = concat!($quote, "(", $strchar, ")*", $quote) in {

    #[derive(Logos, Debug, PartialEq)]
    enum Token {
        #[regex($strliteral)]
        StrLiteral,
        #[error]
        Error,
    }

}}}}}}

I could imagine a more flexible version of such a macro, that enables substitution of functions like concat even in the final body (so you wouldn’t have to give a name to every string that’s going to use $strchar), and allows declaring multiple lets at once without extra indentation.


Edit: The situation can be (syntactically) improved by writing a small “tt-muncher” macro around this, e.g.:

use logos::Logos;
use with_builtin_macros::with_builtin;

macro_rules! concatenations {
    ($(let $dollar:tt $name:ident = ($($t:tt)*)),* in $($body:tt)*) => {
        nestings! {
            [$($body)*][][$([with_builtin!][let $dollar$name = concat!($($t)*) in])*]
        }
    };
}
macro_rules! nestings {
    // reverse everything
    ([$($r:tt)*][$($s:tt)*][$t1:tt$($t:tt)*]) => {
        nestings!{[$($r)*][$t1$($s)*][$($t)*]}
    };
    // build up recursively from inside-out
    ([$($r:tt)*][[$($s1:tt)*] $($s:tt)*][]) => {
        nestings!{[$($s1)* {$($r)*}][$($s)*][]}
    };
    // done
    ([$($r:tt)*][][]) => {
        $($r)*
    }
}

concatenations! {
    let $quote = ("\""),
    let $strchar = (r"(\\[nt", $quote, r"\\])|[^\n\\", $quote, "]"),
    let $strliteral = ($quote, "(", $strchar, ")*", $quote) in

    #[derive(Logos, Debug, PartialEq)]
    enum Token {
        #[regex($strliteral)]
        StrLiteral,
        #[error]
        Error,
    }
}

In this case, may I suggest you to use a PEG instead of separate lexer and parser stages? Pest is a quite convenient implementation.

1 Like

I've contacted my professor and he is cool with this. Looks like a great tool!

1 Like

Yeah, I'm planning on adding "batched" invocations since the nesting is cumbersome, and this "macro-rules batcher frontend" layer has already come up in more than one occasion :sweat_smile:. Soon™

I think I might fully embrace OCaml's syntax and allow and instead of in to chain that :thinking:

with_builtin! {
  let $quote = concat!("\"") and
  let $strchar = concat!(r"(\\[nt", $quote, r"\\])|[^\n\\", $quote, "]") and
  let $strliteral = concat!($quote, "(", $strchar, ")*", $quote) in {
    #[derive(Logos, Debug, PartialEq)]
    enum Token {
        #[regex($strliteral)]
        StrLiteral,
        #[error]
        Error,
    }
  }
}

or maybe just:

with_builtin! {
    #![let $quote = concat!(…)]
    #![let $strchar = concat!(…)]
    #![let $strliteral = concat!(…)]

    #[derive(Logos, Debug, PartialEq)]
    enum Token {
        #[regex($strliteral)]
        StrLiteral,
        #[error]
        Error,
    }
}
1 Like

I don't know if this will work for the OP, because I haven't tried, but often you can get away with evaluating a const, and then referring to the const in your generated code.

I've generally only used this for side-effects like generating errors when the const is evaluated,
So I haven't actually tried to see how it works in any kind of expression context, but have done things like this where foos is some vec of expressions...
#( #[allow(clippy::unused_unit)] const _: () = if #foos.is_none() { () } else { panic!("oh-no") }; )*

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.