It's impossible to create an LR(1) parser for Rust

IDK, I just up read stuff about LR parsers to freshen up on the topic[1] and then intuited what the parser state actually looks like and why - the error message ended up very useful to be able to do this, containing a lot of the relevant details already (in slightly cryptic form) :slight_smile:

Basically, you want to be able to start building up the outer_attr or the inner_attr from parts of it already being on the parser stack (thus making # legal to be shifted), which means the single attr must be sandwiched between the previous known-inner attrs, and subsequent known-outer attrs in a way I'm which it can be reduced later.

That basically is what motivates the nesting or outer* outer inner inner* so that when you're at outer* yet_unknown ...unparsed and move on to complete it to outer* outer ...unparsed then reduce the outer* outer into a single outer* or complete it to outer* inner ...unparsed, moving on smoothly into "already being in the middle of the first item" which only works if you need none of these virtual empty inner* before it..

..thinking about it now once again, I wonder:

maybe it also works, instead of changing the nesting, to just turn the attr*-style structure into more of a (attr+)? structure? This could also avoid the need for this virtual initial emptly-list-of-inner-attr early reduction, AFAICT.

E.g. if

outer_attrs:
  | /* empty */
  | outer_attrs HASH outer_attr 

became something like

outer_attrs:
  | /* empty */                             { [ ] }
  | inner_attrs_non_empty                   { $1 }
outer_attrs_non_empty:
  | outer_attr                              { [ $1 ] }
  | outer_attrs_non_empty HASH outer_attr   { $3 :: $1 }

This still – again – would be a change inconsistently applied though, as the same thing on inner_attrs would break it again. (But at least the [reversed] oderings are consistent again :innocent:)


Still just intuiting, but for a more “consistent” alternative solution, treating the two the same, maybe it can work to include the optionality of the construct into the surrounding construct… e.g. only have never-empty lists of attributes

outer_attrs:
  | outer_attr
  | outer_attrs HASH outer_attr

outer_attr:
  | LBRACK attr RBRACK

inner_attrs:
  | inner_attr
  | inner_attrs HASH inner_attr

inner_attr:
  | EXCLAMATION LBRACK attr RBRACK

and then do stuff like

item:
  | vis_item
  | outer_attrs vis_item

expression:
  | expression_no_block
  | outer_attrs expression_no_block

and

safe_module:
  | MOD ident SEMI
  | MOD ident LBRACE items RBRACE
  | MOD ident LBRACE inner_attrs items RBRACE

unsafe_module:
  | safe_module
  | UNSAFE MOD ident SEMI
  | UNSAFE MOD ident LBRACE items RBRACE
  | UNSAFE MOD ident LBRACE inner_attrs items RBRACE

That’s just me wondering though… it probably just unnecessarily complicates stuff… well –– I guess it could be worth considering if you prefer the (reversed) order to be consistent between these attribute-lists; in which case you’d presumably only need it for outer_attrs –– for which I already determined that this sort of "inlining" aspect of it shouldn’ŧ actually be necessary, ah, oh well… that’s just back to the first suggestion for alternatives I have made in this reply, anyway :sweat_smile:


  1. and learn some more things; and become annoyed about the hard-to-approach nature of most material about the topic online ↩︎

I noticed that the shebang is only allowed on the first line, meaning I can use a special mode that will lex the first line, and if it encounters a \n, space, or what not, It'll switch to the normal mode

Something else to consider, which can be very annoying when parsing Rust, is the content of the attributes and macros. They can contain any chain of terminals (tokens), but brackets/parentheses/square brackets are always in pairs. And comments are removed, on top of that. So I'm wondering if that part can be considered context-free.

I sounds like the lexer must be tweaked to handle those cases.

Yeah, I think that the first suggestion is the most practical

There is no need to treat this as a single token, but it's certainly a possibility. You can easily have a tokenizer that processes whitespace (or even comments) inside tokens. After all, here is whitespace that gets removed inside a string token:

    let a = "abc\
        def";

Fortran even allows whitespace that gets ignored inside identifiers.

1 Like

There is still the issue of the shebang.

The exact rules rustc uses are:

1 Like

It's just a simple state machine, in fact, so explicit lexer states are probably not even necessary.

This mock-up would work with ANTLR (attributes must be empty):

grammar Rust;

fragment SPC_COMMENTS:		[ \t] | '/*' .*? '*/';
fragment SPC_LN_COMMENTS: 	[ \t\n\r] | '/*' .*? '*/';

SHEBANG:		'#!' SPC_COMMENTS* ~[ \t[] ~[\r\n]*;
INNER_ATTR:		'#' SPC_LN_COMMENTS* '!' SPC_LN_COMMENTS* '[';
OUTER_ATTR:		'#' SPC_LN_COMMENTS* '[';

LBRACKET:		'{';
RBRACKET:		'}';
RSBRACKET:		']';

MOD:			'mod';
FN:				'fn';

ID: 			[a-z][a-z0-9_]*;

SPACES:			SPC_LN_COMMENTS+ -> skip;

program:
	SHEBANG? item* EOF;

item:
	inner_attr* mod;

inner_attr:
	INNER_ATTR RSBRACKET;

mod: MOD ID LBRACKET mod_item* RBRACKET;

mod_item:
	outer_attr* fn;

outer_attr:
	OUTER_ATTR RSBRACKET;

fn:
	FN ID LBRACKET RBRACKET;

Example of text:

#! /bin/rust

#![]
mod a {
	#[]
	fn a { }
}

Notes:

  • The upper part is the lexer and the bottom part the parser, obviously (ANTLR allows to put both in the same file).
  • ~[ \t[] means any character not in the set, so anything but a space, a tab, or a '['.
  • fragment is just a lexer alias, not a rule; in some lexers, you'll have to copy its definition wherever it's used.
  • This doesn't check that the shebang is on the first line; there's an ANTLR idiom for that, but I preferred not to use it here. Most lexers and parsers allow to check the position of the text that comes with the tokens, so it's one possibility if that's important.
1 Like

This is a good reference for when I write it in Menhir

1 Like

Yes, sorry that I don't know that tool. If something's not clear, don't hesitate to ask. :slight_smile:

The challenging part will be the content of those attributes...