Idea: Semantic Whitespace Parsers for `nom`

Hi all! I've been using nom a lot recently for parsing "programming-language-esque" things, and I think I've finally come up with a principled way of handling whitespace. Note: this will only really work if you're parsing "whitespace-insensitive" languages (though these languages still require whitespace in some places, like around keywords).

(See the attached gist for the full definition and some examples.)

The main idea is to get yourself asking questions like:

Is whitespace allowed to appear after/before/around this thing?

Is whitespace required to appear after/before/around this thing?

Is it required, or allowed for whitespace to appear exactly here?

I have yet to do much testing with this, but I have been able to refactor the parser for my pet-project and it at least looks a lot nicer. I'd appreciated any feedback y'all have on this, and if people like this, maybe we could make it into a crate! Thanks!

1 Like

Looks nice!

To make it general purpose, do you think it'd be worth separating out the logic for comments, or otherwise allowing configuration constitutes a comment? Right now the code looks limited to languages which use // comments. To be honest, though, I'm not sure what the best way to make that configurable would be, though...

1 Like

Yeah all the other parsers defined here depend on the comment parser, so it would be nice to be able to customize it.

Option 1: Traits

Maybe that means instead of a series of nested modules, this crate would need to be use a trait:

trait WsParser {
    fn comment();
    fn allowed_here() { /*default impl*/ }
    fn allowed_before() { /*default impl*/ }
    fn allowed_after() { /*default impl*/ }
    fn allowed_around() { /*default impl*/ }

struct Space;

impl WsParser for Space {
    fn comment() { /*your custom impl here*/ }

// Use it like this:

I feel like this doesn't read very smoothly though. Plus, you'd have to rely on the user defining a struct with a name that reads well.

Option 2: Statically Configurable

Maybe you could just pass in the end-of-line comment signifier string as a config option in Cargo.toml somehow. Like you could specify "this language I'm parsing uses // to begin end-of-line comments".


Neither of these options allows the user to specify multi-line comments, so they are in no way a complete solution.