It's impossible to create an LR(1) parser for Rust

I'm currently creating a parser for rust using Menhir and Coq, and I noticed that the shift/reduce conflict error when it comes to inner and outer attributes is impossible to solve. This is an example

#[allow(unused_variables)]
mod my_module {
    #![allow(dead_code)]
    fn hidden_function() {}
}

In order for a parser to be considered LR(1), it should be able to look at the 1 token after it and be able to decide whether to shift (move forward to the next token), or reduce (turn the tokens in it's scope in a node for the AST).

For example, if the compiler encounters the unsafe keyword, then it will know to shift, as there is more data it needs to figure out before making a node in the AST. However, if it encounters a semicolon, then it knows to reduce, as the semicolon indicates that there are no more tokens to shift to in order to create a node.

We can see that the top level module has a outer attribute, and because it isn't nested in another item, the parser does not expect an inner attribute yet. That means that it is obvious to the parser that a # must mean outer attribute. However, once we get into nested items, this becomes impossible. If the parser encounters a #, then it is not able to immediately tell whether it will be reading an outer attribute or an inner attribute. It must read !, or lack thereof, in order to figure out whether it will be reading an outer attribute or an inner attribute. This gives it an LR(2), as it must read 2 tokens ahead to know whether to shift or reduce.

This minor issue could be solved by making inner and outer attributes start with different tokens.

Here, the inner attribute starts with an exclamation point (!), which makes it LR(1), as there is no need to look to the next token for the parser to know what rule to apply.

#[allow(unused_variables)]
mod my_module {
    !#[allow(dead_code)]
    fn hidden_function() {}
}

Is my conclusion correct? What do you think? I don't believe there is a need for this to be changed, as I can easily write a formal proof to prove that this LR(2) does not create undefined behavior.

Not sure if I follow. As far as I can understand, free-standing # will always shift, since to reduce parser has to read the following brackets. If this # is followed by !, it will shift again, expecting to read the brackets and then reduce to the inner attribute. If it is followed by brackets directly, it will reduce right away, this time to the outer attribute. I guess there's some kind of flaw in this description, but what it is exactly?

And by the way:

This is incorrect - inner attributes at the top level are syntactically valid: they may refer to the whole crate, for example.

You're right, inner attributes can be applied at the top level. I think this Menhir conflict is better at explaining this. My code is here. There is no README yet, but you can run menhir --coq --explain --dump Parser.vy to see this.

** Conflict (shift/reduce) in state 45.
** Token involved: HASH
** This state is reached from program after reading:

outer_attrs MOD ident LBRACE inner_attrs

** The derivations that appear below have the following common factor:
** (The question mark symbol (?) represents the spot where the derivations begin to differ.)

program 
items EOF 
item items 
outer_attrs vis_item 
            unsafe_module 
            safe_module 
            (?)

** In state 45, looking ahead at HASH, shifting is permitted
** because of the following sub-derivation:

MOD ident LBRACE inner_attrs items RBRACE 
                 inner_attrs . HASH inner_attr 

** In state 45, looking ahead at HASH, reducing production
** outer_attrs ->
** is permitted because of the following sub-derivation:

MOD ident LBRACE inner_attrs items RBRACE 
                             item items 
                             outer_attrs vis_item 
                             outer_attrs HASH outer_attr // lookahead token appears
                             .

HASH refers to #. As you can see, the parser is not sure whether it will be reading an inner attribute or outer attribute, so it throws this shift/reduce conflict.

What happens if you were to do outer_attrs nested the other way around?

turning

| outer_attrs HASH outer_attr

into

| HASH outer_attr outer_attrs

Just a curious question: why do you need to distinguish between inner and outer attributes at the parsing level? That's typically something I'd process at the semantic level, managing the scope when entering/leaving a mod or a file.

Moreover, attributes can be attached to many items, so I'd just let them be an optional non-terminal in front of those items (EDIT: to avoid any confusion, when I write "items", I'm referring to the word's meaning, not to what is defined as "items" in your grammar).

If you tokenize #! as a single token then the issue disappears altogether. But I think even when they are separate tokens it's possible to parse these attributes using an LR(1) parser, maybe it's just the way you wrote the grammar that creates the conflict.

1 Like

Btw, keep in mind that because of raw string literals, Rust's grammar is not context-free: rust/src/grammar/raw-string-literal-ambiguity.md at d046ffddc4bd50e04ffc3ff9f766e2ac71f74d50 · rust-lang/rust · GitHub

3 Likes

It looks like a problem for the lexical analysis, not the syntax analysis.

I tried that before, and got the same thing

Interesting. I would love to see the error message then ^^

Correction, similar thing. However, I think @tczajka gave the obvious solution being to lex #! as a single token


** Conflict (shift/reduce) in state 46.
** Token involved: HASH
** This state is reached from program after reading:

outer_attrs MOD ident LBRACE

** The derivations that appear below have the following common factor:
** (The question mark symbol (?) represents the spot where the derivations begin to differ.)

program 
items EOF 
item items 
outer_attrs vis_item 
            unsafe_module 
            safe_module 
            (?)

** In state 46, looking ahead at HASH, reducing production
** inner_attrs ->
** is permitted because of the following sub-derivation:

MOD ident LBRACE inner_attrs items RBRACE // lookahead token appears because items can begin with HASH
                 . 

** In state 46, looking ahead at HASH, shifting is permitted
** because of the following sub-derivation:

MOD ident LBRACE inner_attrs items RBRACE 
                 . HASH inner_attr inner_attrs

Sorry, I misunderstood. I thought both attributes had the same syntax and that you wanted to separate them based on what they were attached to.

Why don't you try to use a single attr nonterminal with an optional EXCLAMATION in it? Again, that's something I'd process during the semantical analysis.

This does sound like an interesting proposition. The only reason why I distinguish between them is because the Rust reference does to.

I do think that this would work, but I think the best solution would be to make #! into one token

That's also an option.

In my experience, it's often required to transform the language grammar that's provided in references. The key is to decide what to do in the lexical, syntax, and semantical analysis phases. It's also a matter of balance: in the lexical part, you'll increase the lexer tables; in the syntax, you'll increase the parser table and it'll require you to process more rules in the semantical analysis; in the latter, you'll need to add a check to decide if it's an outer attribute and make sure it's above other declarations.

That's not quite what I suggested. I suggested flipping the nesting for outer_attrs, not for inner_attrs :wink:


As for making #! a single token, that's at least not accurate to what actual Rust syntax looks like where these tokens are pretty clearly separate.

3 Likes

Your suggestion somehow removed all conflicts!
This is what fixed it.

outer_attrs:
  | /* empty */                             { [] }
  | HASH outer_attr outer_attrs                 { $2 :: $3 }

outer_attr:
  | LBRACK a = attr RBRACK             { Cabs.OUTER_ATTRIBUTE a }

inner_attrs:
  | /* empty */                             { [] }
  | inner_attrs HASH inner_attr                  { $3 :: $1 }

inner_attr:
  | EXCLAMATION LBRACK a = attr RBRACK      { Cabs.INNER_ATTRIBUTE a }

The reason why this issue is solved is because one is left recursive while the other is right? Though, I'm not sure how that would solve it

There's also the shebang... :wink: (and in that one you can't split the #!)

Ok, that definitely changes things. Luckily, I was somehow able to fix all conflicts due to @steffahn 's suggestion for whatever reason. For now, having separate parser tables for inner and outer attributes would be the simplest, though it may be the cases your suggestion works better.

I must say, I didn't know about that one before today.

It's pretty annoying since it doesn't require anything special after the shebang. Just this:

As an exception, if the #! characters are followed (ignoring intervening comments or whitespace) by a [ token, nothing is removed. This prevents an inner attribute at the start of a source file being removed.

So the inner attribute is separable by spaces & comments, but not the shebang. In other words, you have something like this, where the spacing means that the symbols could be separated by spaces and comments:

#!  (not [)     -> discard the whole line
#  !  [         -> inner
#  [			-> outer

I think it could be possible to process that at the lexical level with states (or modes, depending on how the lexer calls it—I don't know if your tool has that feature), and by rejecting spaces and comments in the appropriate modes. You'd get specific terminals for # and #! attributes, and the shebang would ideally be discarded by the lexer.

Or you can do that in the grammar, but you still need separate terminals for # and #! because the shebang doesn't permit any space in between, and you won't be able to tell at the syntax level.