Making a parser

I was wondering if anyone had any advice, on making a parser? Unlike the other question, I wouldn't want to use a library like nom, or rust_lex, I would like to implement the full process myself. I've had a look at the source code for both pulldown-cmark, and Sassers, but it hard to determine how to do it just from source code. Any articles, that are more theoretical, or rust specific would be a big help.

3 Likes

Hi Aarone, perhaps you would find some inspiration here? I believe Nicholas even references a number of other existing Rust parser generators from his article. I know it's not exactly the kind of advice you were after but I still hope you may find a pointer to that article useful.

Would you like to implement a parser generator, or a hand-written recursive descent parser?

In the latter case, you can take a look at the Rust parser itself. It is big, but quite readable. Here is an entry point.

As for the theoretical coverage of parsing, the Dragon Book is a good read.

I have written one and a half libraries with manual parsing in Rust and hopefully my observations here will be helpful :slight_smile:

There are basically two approaches for designing an API of a parser which affect its implementation greatly. These approaches are usually called pull parsers and push parsers. The former consume an input stream (usually bytes or Unicode code points) and produce a stream of events which the user then can consume gradually, by need. They are also sometimes called streaming parsers. The latter consume an input stream and invoke callbacks which were configured upon the instantiation of the parser. These callbacks correspond to different kinds of content in the input document.

Note that it is trivial to write a push parser if you have a pull parser at hand, but it is very hard to write a pull parser using a push parser without losing "streamingness", that is, the need to consume the whole input stream beforehand. However, pull parsers are harder to write because you need to invent a way to keep the state of your parser between pull attempts. Therefore, some kind of state machine is usually used to hold the state of the parser.

Most of document formats (XML, JSON, TOML, YAML, you name it) usually support DOM representation, that is, a tree structure (or possibly a graph, in YAML, for example) which directly encodes different parts of input document. It is trivial to write a parser which produces a DOM tree if you have either pull or push parser - you need to consume the whole stream to construct a DOM, and both approaches work fine for this. If you only need a DOM representation of the input document, you may combine combine push parsing with DOM construction, effectively inlining closures for events and constructing DOM objects directly.

After you decide on the kind of API you want to provide, you have different options for actual implementation of the parser. It mostly depends on the complexity of the grammar of your input documents. This is where terms like recursive descent parsing or recursive ascent parsing come in. There are also other questions you'd need to answer, for example, whether you need a lexer for your language or not. What you choose here is mostly a matter of personal preferences and particular aspects of your grammar.

There are already lots of libraries in Rust with a hand-written parser inside them, in fact. Rust language features, such as algebraic data types and pattern matching, greatly simplify the task of writing parsers.

Several libraries, for example, are based on some kind of state machine whose input is bytes or characters and whose output is a sequence of events or a DOM tree, for example this is how (a shamless plug!) xml-rs or RustyXML work. In xml-rs I also use a lower-level lexer to simplify handling of atomic tokens in the parser. In terms of implementation, xml-rs is basically a recursive descent parser with an explicit state machine instead of actual recursion. These are examples of pull parsers.

Others, like toml-rs, construct DOM trees, continuously consuming their input, without the need to hold intermediate state. This is an example of a push parser which constructs DOM trees directly, and it also looks like a recursive descent parser which actually uses recursion and call stack to keep its state.

Anyway, unless you want to write a generic library while keeping some space for future optimizations, you probably don't want to write a parser manually. You should consider using parser combinators or generators first.

8 Likes

Nice post! csv has a relatively simple example of a pull parser.

2 Likes

Wow, don't know how I forgot about it! CSV is probably one of the simplest formats to learn how to parse it, so it is probably a nice starting point.

Eeee I am both honored and scared that you've looked at sassers :slight_smile: I'm very much learning and experimenting myself! I've learned the most from pulldown-cmark and the rustc parser.

I could try writing about the techniques I'm trying and problems I've run into doing this in rust specifically, do you have any particular questions you'd like to see addressed?

I have looked into parser combinators, and generators. I do not like them. I'm not trying to make a generic library I'm trying to write a compiler like pulldown-cmark or Sassers.

Well, under "generic library" I meant libraries like csv, toml-rs and xml-rs - that is, not a part of some application but general-purpose libraries for working with some specific format, as opposed to single-purpose libraries/modules for some specific format in a concrete application. Your case seems to be general purpose enough, so it may make sense to write something custom indeed :slight_smile:

Also as far as I know, there are no libraries for generating or combining streaming parsers, so if you want one, you probably need to write it yourself anyway. Not that it should be done without consideration - streaming parsers are considerably harder to write than non-streaming ones for relatively complex grammars, and potential gain may be very little. toml-rs, for example, requires a complete string in order to parse a TOML document, but it is not really a problem in practice as most of TOML documents are small enough to load them into memory entirely.

I'll definitely have a look into making a csv parser. Thanks.

Depends on complexity you want to parse. Some parsers can be implemented as regular expressions if you change expressions by state machine to track semantic. Python's pygments use this approach (Rust's pygments). I've made obfuscator for a propriatary programming language in this way: it works great and extremely simple to debug.

Also regular expressions with callbacks (like Ragel) suit for similar tasks but it hasn't Rust target (implementation depends on goto operator).

I wrote a series of articles called Let’s build a browser engine! that includes very simple hand-written HTML and CSS parsers, with tutorials to walk you through the code.

Markdown, among other things, requires two passes to parse and infinite lookahead. Unless you're trying to parse a complex language like that, and fast, don't use pulldown-cmark as an example.

In other words, same thing @carlos10cents said.

That is actually exactly what I want to do. Haha

Eh, with your project specifically. Why have fn's like is_ascii_whitespace? Why not build for Unicode? Why not use iterators like chars()? What was your starting point? How did you organize the rules of the language?

That function, and a few others, I straight up copied from pulldown-cmark in order to get something working. I've been slowly refactoring and improving these kinds of things as I go, and I plan to supporting Unicode eventually.

I might switch to using that, but again, I was using pulldown-cmark as an example and that uses bytes, although I see now that they also have a scan_codepoint function that uses chars(), so I will likely switch to using something like that soon. I needed to get started with something actually working in order to understand all the issues :smile:

One neat thing about sass is, because it was originally written in Ruby and then rewritten in C, they have a test suite external to both of those called sass-spec. So I'm slowly working through those test cases-- the first one is an empty file, the second one a simple case with one rule, etc. I'm a TDD proponent, so it's a lot of try the next test case (red), do the simplest thing to get the test to pass (green), then refactor to make the code make more sense, have less repetition, etc (refactor).

Looking through the commits, I've done a few MAJOR reorganizations along the way (like here, here, and here). I've been trying hard to avoid prematurely refactoring things, which means I've been letting messiness stay around longer than I usually would. I feel like it's helped me to make progress, save time that might be spent un-refactoring, and let me learn new techniques when I had enough pain to make me really want to learn them :wink:

You can see where I started embracing Cow, where I learned how to implement Iterator, where I got a bit more comfortable with closures, and where I started using custom Errors and Result everywhere.

I'm using Enums a lot to enforce what things behave the same and what things behave differently. This works out really nicely with rust's match statements. I'm also starting to extract little methods that define "character classes" of sorts, like characters that are valid CSS units. The rules I've had the most trouble with (so far) and still don't particularly like my solution to are those around when / is treated as a separator and when it is division (see this gross code).

Hope that helps!!

2 Likes