Parsel, a zero-code parser generator

Today I released Parsel, a "zero-code" parser generator.

What?

Parsel is a set of #[derive] macros and helper types for generating ready-to-use parsers very easily and quickly.

Why?

The Rust ecosystem already offers a wide variety of parsing crates, so why would you need yet another one? There are basically two main approaches to parser generators in Rust:

  1. Parser combinators. These are implementations of the classic design from functional programming languages. They allow programmers to write the parsing logic itself seamlessly. However, one would still need to define strongly typed AST nodes, and then map the low-level output of the combinators manually. In addition, the grammar is implicit in the procedural part of the code (i.e., the implementation of the parsers itself).
  2. Traditional and macro-based parser generators (such as Pest), which make the grammar explicit. However, they still suffer from the need to map parsed values to domain-specific AST node types after the fact. (This is especially apparent in the case of Pest, which explicitly encourages the use of unwrap() on the nodes of the generated raw parse tree.)

Parsel solves all of these problems by letting your custom AST node types be the ground truth for the grammar, from which it automatically derives implementations of the syn::Parse and quote::ToTokens traits.

How?

The way this works is as follows:

  • struct types will implement both traits by parsing and printing each field in sequence, in the order of declaration. This behavior corresponds to simple sequencing/concatenation in a grammar.
  • enum types will implement Parse by attempting to parse each variant in order, and returning the result of the first one that succeeds. ToToken will be implemented on enums in the obvious way, by forwarding to the fields of the currently active variant. This behavior corresponds to alternation of productions in a grammar.

What about parsing more complicated productions, such as arbitrary repetition, parentheses/grouping, and optional sub-productions? Parsel provides a rich set of generic helper types for each such common task. You can parameterize these helper types on the sub-productions of your grammar, and then directly embed them into higher-level AST node types in order to support deriving all kinds of complicated parsers.

Usage Examples

See the JSON parser example in the official docs.

Other Features, Highlights

  • Deriving FromStr and Display implementations by forwarding to Parse and ToTokens, respectively.
  • Built-in heuristics for improving error messages of (potentially speculative) parsing of alternation. Currently, when the parsing of all variants of an enum fails, Parsel returns the error message corresponding to the production that got furthest in the underlying token stream. It is likely that this production/enum is the one that was intended by the author of the input being parsed.
  • AST node helper types for parsing…:
  • AST node helper types (LeftAssoc and RightAssoc) for avoiding infinite left recursion and deep right recursion when trying to naïvely parse binary operators. These helpers use iterative parsing and transform the parsed structure into a proper left-leaning or right-leaning tree internally.
  • Macros for defining your own keywords in the grammar being parsed. The provided CustomIdent type can also be parameterized over the set of keywords, and it will reject such keywords when parsing, regardless of the behavior of proc-macro2's Ident.
  • Handling redundant trait bounds generated by recursive (or mutually recursive) productions, by means of a #[parsel(recursive)] attribute. Such bounds should be omitted because they currently cause rustc's trait solver to fall into infinite recursion, even though they are technically correct. Thus, the need for this attribute will hopefully be obviated once Chalk lands in stable Rust.

Limitations

  • Parsel builds upon syn and quote, so their assumptions need to be satisfied. For example, it is not possible to parse whitespace-sensitive grammars.
  • Performance is likely worse (by a constant factor) than that of an equivalent hand-written recursive descent parser.
15 Likes

It sounds like you’ve chosen a comprehensive¹ (PEG-style) grammar definition instead of a constructive one (BNF-style). This has the nice property of defining away grammar ambiguities, but at a cost: If you have two different parse trees that represent the same token stream, serializing one of them will parse to the other. Do you have any guards against programmers accidentally tokenizing these impossible-to-reproduce structs?

¹ i.e. based on comprehension — There’s probably a standard term for this family of grammar descriptions, but I don’t know what it is.

1 Like

No – in brief, it doesn't attempt to guard against people not knowing what they are doing :sweat_smile: A couple other mistakes one might make:

  • Placing a variant which represents a subset of another production before that other, longer production
  • Writing infinite recursion is still possible if you make a naïve left-recursive type
  • Write enums so complicated that speculative parsing becomes really slow (eg. even quadratic)

These all seem very hard to detect if I want the generated code to be locally compositional (ie. not require knowledge about subproductions or enclosing productions).

3 Likes

that can be avoided by using the packrat parser algorithm, which can parse any PEG in linear time (assuming no left-recursion):
http://pdos.csail.mit.edu/~baford/packrat/thesis

there's apparently also the pika algorithm, which can parse any PEG even including left-recursion, as well as having much better error-recovery -- it's also linear time:

1 Like

I'm aware of Packrat parsing, but as a first attempt, I definitely wanted to emit simple recursive descent code, and I'm still not sure off the top of my head how I would go about caching state across AST nodes (while still keeping the generating and the generated code clean). Maybe one day I'll do some research or figure it out.

i wrote a packrat parser generator in c++ several years ago, you could try it out and read the generated code for some ideas of how to do the AST memoization:

you'll need a c++11 or later compiler and cmake and make:
on Ubuntu (other OSes should work, but you'll have to figure out installing cmake and stuff yourself):

sudo apt install g++ build-essential cmake make

building:

cmake .
make

generating the example parser:

./peg_parser_generator test.peg

the generated output is in test.cpp and test.h, you can build/run it by running:

make -f test.mk
./build-test/test

the method i used is kinda like:

1 Like