Hi everyone!
I am currently writing a post comparing my new parser-combinator library, marser with nom and chumsky. In this post I am comparing code written using the three libraries on concrete example grammars.
Before I share this post I want to make sure that the comparison is fair and that the code I have written in chumsky and nom is idiomatic. I would really appreciate it if someone could look over the code I have written for nom and chumsky and tell me if its ok or I have used these libraries wrong somehow.
You can also see the code here. I have used AI to try and help me make the code more idiomatic, the post is written completely by me though.
Any general feedback about the post is also very welcome!
The question that spins in my head is what is the purpose of the work.
I looked at the code briefly, and it looks like an attempt to reimplement chumsky+ariadne from scratch, but with some weird decisions with unsafe blocks. The parsing-combinators library probably could be written entirely in safe space without the loss of performance. If the perfomance was a goal, comparison benchmark tests would be more appropriate.
Also, I don't see anything innovative, that you cannot do with already existing nom, except that it suggests to use macro rules instead of normal combinator-functions. Macros, in general, slow down compilation time, and make the end codebase lesser transparent.
I know that PEGs are popular, but most of the time we are parsing simple recursive-descending LL(k) languages (often, just LL(1)) without the need of PEG predicates to resolve grammar ambiguities, but PEGs make the entire error-recovery process much more complex.
If it has been done entirely for the learning purposes, that makes sense, but why did you use artificial neural networks then?
Chumsky is better, but personally prefer manual parsing
Hi, thank you for taking the time to look at the post and write this response!
it looks like an attempt to reimplement chumsky+ariadne from scratch
Yes, but more just chumsky. In the post I have forgotten to mention that the rendering or errors is done using annotate-snippets which can be enabled with a feature. I have added the information to the post.
some weird decisions with unsafe blocks
I use unsafe to enable packrat-style caching of parser results which have different types by using type erasure. Sadly I cannot use dyn Any here because Any requires 'static and I want to allow for zero-copy parsing. If you have any suggestions on how to do this without unsafe, I am very open to suggestions. I have tried documenting here why i did this: marser/src/cache.rs at main · ArneCode/marser · GitHub
Also, I don't see anything innovative, that you cannot do with already existing nom, except that it suggests to use macro rules instead of normal combinator-functions.
That is hard hard to argue with because what is innovative is of course subjective. Here are some reasons why I think marser adds something new:
- In other parser combinator libraries you will often need to restructure your grammar or add code to ignore certain values but keep others. Marser tries to make you able to write code that is pretty similar to grammars and add the construction of result values seperately with capture!(grammar_with_binds => result). I think this makes parser code simpler to read and reason about, because it is only grammar, but I am of course biased. Other libraries like peg have done something similar but I think they have less separation between grammar and ast construction and have a lot more macro syntax which as you mentioned makes the code base less transparent
- Marser has a TUI that allows you to debug your parsers, which i think is pretty useful, you can see a screenshot below, left side is parser source code, right side is the file being parsed.
If it has been done entirely for the learning purposes, that makes sense, but why did you use artificial neural networks then?
I have written the core parsing code myself and then used AI for things like this TUI, documentation and docstrings and also for enabling of rendering the error messages using annotate-snippets. Also some macro code. So basically parts that AI can do pretty well and that didnt interest me as much. I have learned a lot while writing the parsing logic, so for that alone I think this project has been worth it.
Sorry for the double response btw, I hit shift enter which on most platforms just adds a new line but here seems to send the reply 
Well, if the parser uses memoization, it is not zero-allocated anyway. I recommend using simple index-based arenas such as slab or slotmap, or something similar. It's much easier to reason in terms of indexes in already preallocated memory when you are working with graphs, and removes the majority of unsafe bloat from the code.
thats true, but there is a difference between zero copy parsing and zero allocation. This article defines zero copy parsing like this:
A zero-copy parser is a piece of code that can transform the received data into its structured form, without having to copy its contents into new buffers during this procedure.
Lets say you want to parse an identifier like a variable name, you could either parse it into an owned String by copying from the input, or you could return a reference to the input as &str and not have to copy. The second option will be faster and use less memory which is why it is usually preferred.
The reason that I think I need unsafe code is that I need an index-based container that can store values of different types. As far as I can see slab / slotmap only allow for storage of one type. There are libraries that provide storage of different types but they require that the types are 'static. This means that the values cannot borrow from the input and thus zero-copy parsing is not possible
.
I have thought about adding arena allocation or something similar in the past, but when profiling the library it looked like only about 15% of the time was spent allocating so speed wise it did not seem worth it for me. Maybe in the future though.