Whitehole - A simple, fast, intuitive parser combinator framework

Hello everyone,

I'm excited to share the whitehole project with you. It's a general purpose parser combinator framework aimed to be simple and fast.

DiscreteTom/whitehole: A simple, fast, intuitive parser combinator framework for Rust.

Any feedback is welcome.

Greetings.

4 Likes

@DiscreteTom The only parser combinator I know and used is nom. I would love to hear how yours is different and especially why you chose to structure it like that. Of course only if you are interested and have the time to whip up a response.

9 Likes

How did you flatten the tuples ? eat('#') + double_hex() + double_hex() + double_hex();

See the Concat trait:

Concat in whitehole::combinator::ops::add - Rust

1 Like

Yes, nom, this project learned a lot from it, a huge thanks to it :heart: , very good lib.

I think here are the key differences:

Pros of whitehole:

  • Less combinators to remember. There are currently only 6 combinators to remember: eat , take , next , till , wrap , recur. It uses methods (I call them decorators) to change the behavior of combinators.
  • Compose combinators using operator overloading. E.g. eat('a') | eat('b') will eat a or b, eat('a') + eat('b') will eat ab, eat('a') * (1..=3) will eat a for 1 to 3 times. I think this is more intuitive.
  • Stateful-able. nom uses functions as combinators, which makes it hard to have a place to store states. This lib allows you to configure an optional custom state to change the parser's behavior. See examples/stateful for an example.
  • Re-usable heap. nom uses functions as combinators, it's also hard to manage heap allocations. Re-allocation may occur on each combinator execution. This lib provides a centrally managed heap so you can re-use it and prevent re-allocation. See the Fold to heap section.

Cons of whitehole:

  • No bit-level support. Only strings and bytes are supported.
  • No built-in error handling solution. Currently I think error handling should be implemented case by case, and introducing it to the framework may slow down the framework.
  • Less support for streaming data. You need to handle partial input by yourself.
  • A little bit tricky to make recursive combinators. In nom, functions are easy to recurse. In this lib, a special combinator recur is needed. See examples/json_parser_with_recur for an example.

This is the first public facing version (though it is already v0.5.0 lol). Any improvement idea is welcome.

5 Likes

if you can parse any arbitrary bytes, I'd call that binary format support.

You're right. Maybe "No bit-level support" is more precise. Updated.

1 Like

Ehi, nice work. :slight_smile: Having used a lot of different parser generators in the last 20 years I have just one comment: include error management. I understand your point but there are kind of informations that are very important for good error reporting that only the library has easy access to, e.g., the position of the error in the character or byte stteam.

4 Likes

Thanks for the suggestion!

Currently the framework has the ability to forward custom informations (via token values or the centrally managed heap) and locate byte ranges of tokens. I think these general purpose abilities are enough for many error handling scenarios, so currently there is no dedicated error handling solution.

If there is a need that can't be solved by current methods and there is a good solution for it, I will definitely add it to the framework :wink: .

Just wrote a benchmark test compared against nom to show the speed of whitehole: DiscreteTom/whitehole-bench: Benchmark tests for whitehole.

Here is the result on my laptop:

lex_json_with_whitehole: lex 3 json files (total 4609769 bytes)
time:   [4.2659 ms 4.3519 ms 4.4500 ms]

lex_json_with_nom: lex 3 json files (total 4609769 bytes)
time:   [14.197 ms 14.456 ms 14.753 ms]

Please feel free to run this benchmark in your own environment.

Disclaimer: I'm not a master of nom, I tried my best to write the nom code as similiar as the whitehole version. If anyone can optimize the nom code, PR is welcome.

1 Like

if you test a parser do a parser not a lexer. It's hard to validate you do the same thing with a lexer, you could just read all input and say "ok".

Thanks for the suggestion. Added a parser benchmark to the bench repo. Result:

parse_json_with_whitehole: parse 3 json files (total 4609769 bytes)
time:   [34.233 ms 34.959 ms 35.772 ms]

parse_json_with_nom: parse 3 json files (total 4609769 bytes)
time:   [31.791 ms 32.512 ms 33.301 ms]

The reason to choose a lexer as the first benchmark is that parsing nested content (e.g. JSON) using recur is currently a low-light of whitehole.

Besides, I want to prevent heap allocation so I can compare the speed of the two frameworks but I can't find a way to prevent heap allocation in nom when using separated_list0, so the parser benchmark includes heap allocation.

(And of course a lexer is easier to write)