Logos: Lexer matches incorrect token

NOTE: sorry to crosspost issue #279 here but, since the repo has received very little attention from its owner in the past months / years, I hope to receive more help here :slight_smile:

Hello! For the context, I am writing a TeX file parser, and I use Logos==0.12.1 to create a lexer.

You can find a (reduced) version of my Token implementation:

use logos::Logos;

#[derive(Logos, Debug)]
pub enum Token {
    #[token(r"\")]
    Backslash,
    #[token(r"\\")]
    DoubleBackslash,
    #[token(r"\begin")]
    EnvironmentBegin,
    #[token(r"\end")]
    EnvironmentEnd,
    //#[token(r"\begin{document}")] // <- the part that creates problems
    DocumentBegin,                  // <-
    #[regex(r"\\[a-zA-Z]+")]
    MacroName,
    #[error]
    Error,
}

When I run my test suite, see reduced version below, it works fine. However, when I add the enum variant DocumentBegin, which should not be matches by anything in within the test suite, I get an error saying that \begin now matches token Backslash. This does not make sense to me, since Blackslash should only match \\, and nothing else.

If I comment out the Backslash token, then \begin matches MacroName, but not EnvironmentBegin.

After reading the documentation, testing on different versions of Logos, I still cannot understand why this does not work as expected.

Can someone help me about this?

Test suite

#[cfg(test)]
mod tests {
    use super::*;
    use logos::Logos;

    macro_rules! assert_token_positions {
        ($source:expr, $token:pat, $($pos:expr),+ $(,)?) => {
            let source = $source;

            let positions: Vec<std::ops::Range<usize>> = vec![$($pos),*];
            let spanned_token: Vec<_> = Token::lexer(source)
                .spanned()
                .filter(|(token, _)| matches!(token, $token))
                .collect();


            let strs: Vec<_> = Token::lexer(source)
                .spanned()
                .map(|(token, span)| (token, source[span].to_string()))
                .collect();

            assert_eq!(
                spanned_token.len(), positions.len(),
                "The number of tokens found did not match the expected number of positions {strs:?}"
            );

            for (pos, (token, span)) in positions.into_iter().zip(spanned_token) {
                assert_eq!(
                    pos,
                    span,
                    "Token {token:#?} was found, but expected at {pos:?}"
                );
            }
        };
    }

    #[test]
    fn token_backslash() {
        assert_token_positions!(r"Should match \+, but not \\+", Token::Backslash, 13..14,);
    }
    #[test]
    fn token_double_backslash() {
        assert_token_positions!(
            r"Should match \\, but not \",
            Token::DoubleBackslash,
            13..15,
        );
    }
    #[test]
    fn token_environment_begin() {
        assert_token_positions!(r"\begin{equation}", Token::EnvironmentBegin, 0..6,);
    }
    #[test]
    fn token_environment_end() {
        assert_token_positions!(r"\end{equation}", Token::EnvironmentEnd, 0..4,);
    }
    #[test]
    fn token_macro_name() {
        assert_token_positions!(
            r"\sin\cos\text{some text}\alpha1234",
            Token::MacroName,
            0..4,
            4..8,
            8..13,
            24..30,
        );
    }
}

Test outputs

Without DocumentBegin variant

running 5 tests
test tests::token_environment_begin ... ok
test tests::token_backslash ... ok
test tests::token_environment_end ... ok
test tests::token_double_backslash ... ok
test tests::token_macro_name ... ok

test result: ok. 5 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

With DocumentBegin variant

running 5 tests
test tests::token_double_backslash ... ok
test tests::token_backslash ... ok
test tests::token_environment_end ... ok
test tests::token_macro_name ... ok
test tests::token_environment_begin ... FAILED

failures:

---- tests::token_environment_begin stdout ----
thread 'tests::token_environment_begin' panicked at 'assertion failed: `(left == right)`
  left: `0`,
 right: `1`: The number of tokens found did not match the expected number of positions [(Backslash, "\\begin"), (Error, "{"), (Error, "e"), (Error, "q"), (Error, "u"), (Error, "a"), (Error, "t"), (Error, "i"), (Error, "o"), (Error, "n"), (Error, "}")]', src/main.rs:80:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


failures:
    tests::token_environment_begin

test result: FAILED. 4 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

With DocumentBegin variant, without Backslash variant

NOTE: token_backslash test is now ignored (since we removed the variant)

running 5 tests
test tests::token_backslash ... ignored
test tests::token_environment_end ... ok
test tests::token_macro_name ... ok
test tests::token_environment_begin ... FAILED
test tests::token_double_backslash ... ok

failures:

---- tests::token_environment_begin stdout ----
thread 'tests::token_environment_begin' panicked at 'assertion failed: `(left == right)`
  left: `0`,
 right: `1`: The number of tokens found did not match the expected number of positions [(MacroName, "\\begin"), (Error, "{"), (Error, "e"), (Error, "q"), (Error, "u"), (Error, "a"), (Error, "t"), (Error, "i"), (Error, "o"), (Error, "n")]', src/main.rs:81:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


failures:
    tests::token_environment_begin

test result: FAILED. 3 passed; 1 failed; 1 ignored; 0 measured; 0 filtered out; finished in 0.00s

The generated impl likely tries to match in order of declaration of the variants. Since you put the shortest tokens first, and \ is a prefix of \\ and \begin, that means anything starting with \ will match the short Backslash token first, instead of waiting for the complete token.

To fix this, it's standard practice to attempt to match longest tokens first.

Which Logos claims it does.

You could play around with setting up different priorities for your tokens (e.g. #[token(r"\begin", priority = 12)] or something like that), but this doesn't look right TBH. In your test

if you add a space between \begin and {equation}, \begin is also being parsed as MacroName.

Maybe open an issue in the logos repo?

1 Like

I already tried to play with the priority, without any success.

But, here, it is clear the Backslash should not match anything else than r"\".

I did but, as mentioned, this crate is stalled, and I was looking for more help coming in this forum :slight_smile:

Yes, indeed, but the whole point is the differentiate between the two :slight_smile:

I can actually easily work without DocumentBegin variant, and even EnvironmentBegin or EnvironmentEnd. But having a specific variant for them makes the rest of the implementation much simpler.