Fault tolerant parsing with nom

I'm writing a parser for cron expressions with nom. There are several crates that already do this but I want to make it tolerant for input errors. For example, with expression 5 4 *- * * I want to record the error from parsing the days of the month and continue parsing the months and days of the week instead of failing completely.

I have a basic version of the parser working here but I can't figure out how to make the fault tolerant part work. Obviously in cron_expression() when the first parsing error is encountered, it's propagated by the ? operator and the complete parser fails but I don't see another way to get to the remaining input to pass it to the next parser.

The usual solution is to have a separate "error" node type in your AST, and upon encountering a syntax error, produce an instance of it as if it were a normal parse result, rather than returning an error.

4 Likes

I was just thinking the same but how do I access the remaining input string to pass to the next parser? I'm currently trying to split the input with the separated_list1() parser and the separately parsing the fields but that feels a bit weird with nom.

It's not clear what you mean here? In nom you return the value and the remaining input on successful parse, so you are defining what the remaining input is when you recover and return the "error value" as a success, so you don't need to do anything special elsewhere.

This might look like (untested!):

let field = alt((
  map(terminated(u32, space0), Field::Present),
  // Recover: discard the invalid char 
  value(anychar, Field:: Invalid),
));
let fields = tuple((field, field, field, field field));
let parser = map(fields, Cron);

but it depends on what your error recovery looks like.

1 Like

return the "error value" as a success, so you don't need to do anything special elsewhere

That's a good idea I hadn't considered.

I made some changes to the code so that each Entry of a field can be either a valid entry or an error. For example, when parsing 5 1,! * * * I want to parse hour 1 and an error for the invalid !.

The updated code is here. I added some unit tests for parsing the invalid entries and they all pass, crfirming that the parser returns an Ok value wrapping the error. But when I try to parse a full expression that contains an invalid entry the full parser fails instead of the returning the the individual parsing failures.

The limitation of a parser combinator like nom is that if parsing fails, it is difficult to determine why it failed or how to proceed.

As others have pointed out, you can extend the grammar to make some originally invalid input valid, and then check after parsing to catch errors. The problem is that such extension of the grammar only cover a very limited number of possible parsing errors, while making your grammar more complex, with extra maintenance and runtime cost. This is really only feasible if there are certain errors that you have identified as important to be forgiven. Or, if you can extend the grammar in a way that does not increase complexity. Either way, only limited number of parsing errors will be covered.

The deeper problem is that with invalid input, the intention quickly becomes ambiguous. For example, if the input is "** ", was the intention "* " or "* * "? It may be better to refuse than to interpret the input incorrectly.

Finally, my knowledge of chron expressions is a bit rusty, but aren't they simple enough that using parser combinator may be a bit of an overkill? Maybe I misremember, but can't we just split into lines, then split each line by white space and them for most columns just check if they are "*" or number?

My end goal is to make a crontab.guru clone. My first version of the parser I wrote like you described by splitting on whitespaces and commas. This works fine but I was hoping that nom would make it easier to locate different parts of the cron expression in the input string so that I can recreate the links that highlight them in the input field. It still might be a bit overkill though.

If you just want the error location, that's a different issue! The default nom::error::Error type includes the remaining input, and you can replace it with whatever you're you want that meets the trait to track more information, but that's getting pretty fiddly. You can also annotate and even handle errors with combinators (that's what alt does, for example), it's just that the built in ones won't try to continue from the error location but from the previous input, because it's far easier to reason about, and it doesn't tie them to a specific error type.

Remember also that nom parsers are essentially just Fn(input: I) -> Result<(T, I), E>, and a combinator is just a function that returns a parser, so you define whatever behavior you like in edge cases, it's just that most of the time you can write it more succinctly with the existing combinators. If you want error recovery at several different levels of granularity, there's no problem with that either: handle the cases you know how to handle locally (not a number), locally, handle larger errors (not enough fields in a line) further up, and so on.

I want to mention, that nom isn't the only parser-combinator library available, and that chumsky is a very good alternative when it comes to parsing human-writable inputs, as it actually has focus on error recovery.

This is a fair point and for inputs that are too far from a valid one I'll indeed need to reject the input completely. What I'm hoping to achieve though is to continue parsing when a specific entry of a field is invalid. This is what I tried to do in the entry parser:

fn entry<T>(input: &str) -> IResult<&str, MaybeEntry<T>, VerboseError<&str>>
where
    T: FromStr,
{
    match alt((step, map(range, Entry::Range), map(value, Entry::SingleValue), all))(input) {
        Ok((input, entry)) => Ok((input, MaybeEntry::Entry(entry))),
        Err(nom::Err::Error(error)) | Err(nom::Err::Failure(error)) => Ok((input, MaybeEntry::Error(error))),
        _ => unreachable!(),
    }
}

fn field<T>(input: &str) -> IResult<&str, Field<T>, VerboseError<&str>>
where
    T: FromStr,
{
    map(separated_list0(tag(","), entry::<T>), |entries| Field { entries })(input)
}

There are two problem with this which I haven't been able to solve so far:

  1. When all the parsers in the alt() fail, entry() correctly returns an Ok value with my error but the offending input is not consumed so the following parsers fail on it as well eventually causing the full parser to fail.
  2. On some invalid input like 1- * * * * the entry parser incorrectly parses it as an Entry::SingleValue, leaving the - which again causes the full parser to fail.

I feel like what I want to do is take all the characters that belong to the current entry (i.e. everything before the , that separates it from the next entry, a whitespace that separate the field from the next field, or the end of the input) apply the parsers that parse the entry and correct any invalid input into a separate enum variant. I'm failing to see how to use such an approach with nom though and I'm not sure if this top-down approach even fits well with nom.

That approach should be fine with separated_list on alt, eg:

separated_list(char(','), alt((
  valid_entry, // happy path
  success(Entry::INVALID), // or map(rest, ...) if you want the invalid content
)))

But again, exactly what behavior you're after is the actual problem, recovery is difficult because you're trying to guess what the user intended from ambiguous input. If you want to handle and recover from the user missing a comma at the end of an entry, you would need to handle the comma parser failing, and decide where to put that information in the parsed result.

That might instead look like:

alt((
  value(char(','), None), // no issue
  value(eof, None), // also fine
  success(Some(Error::MissingComma)),
))

And you might either want that in the entry parser, or in the entry list parser, perhaps a fold_many(pair(entry, entry_sep), || ...)

I tried something similar to this with rest earlier but I can't get this to work. I indeed need to consume the invalid input so it doesn't cause the next parser to fails as well. The problem with rest is that it consumes all the remaining input, leaving nothing for the other parsers (which then in turn "succeed" with invalid empty inputs). For example, when parsing 1 1-10/2,# * * * it should be possible to parse the 1 for minutes, the 1-10/2 for hours and the invalid # for hours. With my current implementation Entry::Invalid captures the full # * * *:

    CronSchedule {
            minutes: Field {
                entries: [
                    SingleValue(
                        1,
                    ),
                ],
            },
            hours: Field {
                entries: [
                    Step(
                        Bounded(
                            1,
                            10,
                        ),
                        2,
                    ),
                    Invalid(
                        "# * * *",
                    ),
                ],
            },
            days_of_month: Field {
                entries: [
                    Invalid(
                        "",
                    ),
                ],
            },
            months: Field {
                entries: [
                    Invalid(
                        "",
                    ),
                ],
            },
            days_of_week: Field {
                entries: [
                    Invalid(
                        "",
                    ),
                ],
            },
        }

For more complex invalid inputs I'm fine with just recording the entry as invalid, capturing the invalid input and moving on. If there is a comma missing between entries then I want to parse it as one entry which will probably be invalid.

Ah, yes, I misunderstood separated_list, it doesn't separate first (not all inputs support taking). You probably want something with map_parser() like many0(map_parser(take_while(|c| c != ','), parse_entry)), though that doesn't handle parsing the comma itself. You may also prefer the Parser methods like and_then.

I wonder if in this case I'm beter off manually splitting the input string on spaces and commas and then parsing the individual entries with nom parsers, or parsing manually altogether. Splitting first and then parsing seems to go against the parser combinator paradigm.

If it was, they wouldn't have map_parser! But give it a try, it should at worst be educational, and you can easily mix and match, you just need to have the signature fn (input: &str) -> IResult<&str, T> for whatever T you're parsing, and you can use it like any other nom parser, and you can use nom parsers just by calling them, they return Ok((remaining_input, value)) on success.

I must be coming close. I came up with this but for some reason it fails:

fn field<T>(input: &str) -> IResult<&str, Field<T>>
where
    T: FromStr,
{
    map(
        many1(terminated(
            take_till(|c| c == ',' || c == ' ').and_then(alt((entry, invalid))),
            opt(tag(",")),
        )),
        |entries| Field { entries },
    )(input)
}

The idea is to take all characters till a comma (the end of the entry) or a whitespace (the end of the field) is encountered and then apply the entry parser with the invalid fallback. To remove the commas I'm using the terminated combinator but since there's no comma after the last entry of a field the parser is wrapped in an opt. The spaces between fields should be parsed by terminated combinator in

fn cron_expression(input: &str) -> IResult<&str, CronSchedule> {
    let (input, (minutes, hours, days_of_month, months, days_of_week)) = tuple((
        terminated(field::<u32>, space1),
        terminated(field::<u32>, space1),
        terminated(field::<u32>, space1),
        terminated(field::<u32>, space1),
        terminated(field::<u32>, space0),
    ))(input)?;

    Ok((
        input,
        CronSchedule {
            minutes,
            hours,
            days_of_month,
            months,
            days_of_week,
        },
    ))
}

But when running this on for example 5,6 1 * * * it gives this error:

Err(
    Error(
        Error {
            input: " 1 * * *",
            code: Many1,
        },
    ),
)

The code suggests it's the many1 parser that failed but I'm not sure why.

That output has successfully consumed 5,6, but it's only reporting an error from many1, which has the following error cases:

  • The parser it wraps errored (Err::Error) without any items being parsed, but then you would both get that error instead (for E=Error) and no input would be consumed.
  • The parser it wraps failed (Err::Failure), but you're seeing Err:: Error.
  • The parser it wraps succeeded, but didn't consume any input, because that would be an infinite loop.

So it has to be that last one.

The take_till will currently split that input into the inner inputs of 5, 6, 1, *, *, * before applying the inner parser, that will not presumably not succeed with entry, but invalid is consuming the 5 and 6, and the opt(tag(",")) the comma, but then it can't make progress after that.

You instead want to nest take_till calls with each delimiter separately to get the behavior of splitting each level separately, and you should never successfully parse nothing: many will continue until the inner parser errors, so you don't need to specially handle the end there.

Thinking about it now, you probably want separated_list(tag(","), take_till(...).and_then(...))?