Generating code, reading specification from a file or using a proc-macro?

Hi !

I'm currently experimenting with scanner and parser generators in Rust, and I'm struggling to choose a way to specify the scanners/parsers:

  • Use a specification file like lex, yacc and bison. Allows for a great flexibity when it comes to syntax, can be processed from a build script, but makes validating and manipulating rust code snippets difficult.
  • Use a mix of function-like and attribute procedural macros. Makes parsing and validating rust code easier (using syn & quote), but from experience it seems to make the algorithms run slower. It's also a lot more convenient as it will be directly integrated into the source code, but the syntax could be awkward, and I'm afraid that such an intensive macro won't play nice with Rust Analyzer or IDEs.

Right now I'm using the first solution : reading a specification file with a custom syntax, barely validating rust snippets, manipulating symbols as interned strings and generating code with proc-macro2 and quote. But this solution doesn't satisfy me, it makes a lot of transformations (String to TokenStream and then back to String to write to a file) and it's not very practical (mixing a custom syntax with rust snippets is quite hard to parse).

So, I want to rewrite this but I'm not sure of the path to take: should I stick to an external file but improve parsing (if so, does anyone have any suggestion?) or should I switch to proc-macros ?

Does this mean you're making one or trying to use an existing one?
If you're trying to use one, you might be interested in looking into nom, which is a parser combinator, and I daresay, quite a bit easier to use for most parsing applications. If you really want to use a parser generator, I have found pest to be pretty good.

I mean that I'm making one ! I know about parser combinators, but I'm interested in (almost) completely automatic parser generation in Rust (and in the theory behind it), that's why I'm rolling my own.

Ah, cool! Well, pest uses a proc-macro. The macro takes as input a PEG file path, and is applied to an empty struct. All the other relevant methods are generated by the marco.

I didn't quite get this point. Aren't parser generators by definition "automatic", ie, given a grammar spec, it will generate some sort of parsing code in some language?
Also, when we are in the topic for parsers, what sort of parsing algorithm are you intereseted in? (if you don't mind to share). Some are easier than others as far as Rust goes (probably true in general, but more so in Rust).

pest's idea is insteresting and clever, I might get some inspiration from that.

That's right! That's why it's considered automatic, because it much easier than writing your parsers by hand. But I consider that writing the grammar in half the work, because once you have a clear grammar, writing a classic recursive-descent parser is dead simple.

For my scanner generator, I've implemented the DFA construction. As for my parser generator I've implemented an LR(1) generator (I might consider implementing LALR(1) and GLR in the future). If you more detail, I'm implementing all of this as part of a self-contained "compiler toolchain" library I'm writing.

When I wrote something like this for my compilers course, I found a middle ground between these two that seemed to work pretty well: All of the grammar rules appeared as the argument of a macro, and I wrote a build script that produced the (declarative) macro definitions from scanning the whole codebase.

As I recall, my version only supported LL(1) grammars and the build script defined macros for first! and follow!; everything else was written normally. I unfortunately don’t remember any more details.

Edit: I found some of the documentation, which explains the setup a little more clearly. There were two macros that defined methods if a Parser struct, terminal!(ident) which registered a terminal rule for a lexer token, and nonterminal!(ident -> type { rules … }) which described a nonterminal rule:

/// Define a method to parse a nonterminal from the input.
/// Dispatches to production!() macro for processing the individual
/// rules, so requires ll1_dispatch!() to be properly defined.
///
/// In addition to the normal macro expansion, the
/// LL1 table generator scans this file for invocations
/// of the terminal and nonterminal macros to discover
/// the grammar.
///
/// The body of the macro call contains
/// SymbolName -> Type { ... }
/// Where the ... is 1 or more production rule.
///
/// Each rule takes the form
/// (sym1 sym2 ... symn) ==>> |pat1, pat2, ... patn| {...};
///
/// If the LL1 table decides that the production should apply,
/// the parse method for each of the symbols sym_i is called by
/// the macro, and each parse result is stored in the variables
/// mentioned in corresponding pattern pat_i before the
/// action block is executed. The action must evaluate to the
/// same type as declared for this nonterminal and will be
/// provided to methods higher up the call stack.

Not all (but surprsingly many) grammars can be done using recursive-descent. In fact, most parsers require the grammar to be transformed to suit its shortcomings. The Antlr, which uses an LL(*) algorithm and automatic left-recursion elimination, in my experience, accomodates grammars almost in the way we usually write them. (I mean we normally think expr -> expr + expr | expr - expr | expr * expr | expr / expr | (expr). However, this has left recursion, and so is not usable for any LL parser.)

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.