Parsing a string until a matching string (nom)

I'm learning to use the nom parsing library, and would like to:

Parse a string, such as "regular string but then a file.rar" up until "file.rar", such that it would return Ok("file.rar","regular string but then a ").

I have a parser for parsing 'file.rar', below.

fn rar_ext(s: &str) -> IResult<&str, &str> {
    preceded(multispace0, terminated(alphanumeric1, tag(".rar")))(s)
}

Any thoughts? I can't think of how one would do this with regex, either.

The regex for this would be ^(.*\s+)(.*\.rar), assuming you want to parse the last whitespace-separated component as the filename.

3 Likes

For the output you want, you'd need take_until and not terminated.

preceded(multispace0, take_until(tag(".rar")))

I'm not sure this does it. I just want to get the "plain" string before the file name, and then the file name (without extension).

Also, I don't believe one can use tag()inside take_until, but it could be implemented like take_until(".rar"). Though again, I'm not sure that this fits the bill

I managed to solve this...rather verbosely using insight from This Guide. The issue is parsing a string of plain text until a parser works.
Here is My Solution though I feel like there must be a way to handle this better, it will work.

Insight here would still be greatly appreciated.

In short, here is the bit I'm using to handle 'plain' strings with parsable strings inside:

let output = Vec::new();
while !input.is_empty() {
        let mut found_match = false;
        for (current_index, _) in input.char_indices() {
            match either(&input[current_index..]) {
                Ok((new_input, result)) => {
                    let leading_text = &input[0..current_index];
                    if !leading_text.is_empty() {
                        output.push(Token::Plain(leading_text))
                    }
                    output.push(result);
                    found_match = true;
                    input = new_input;
                    break;
                }
                Err(err) => {}
            }
        }
        if !found_match {
            output.push(Token::Plain(input));
            break;
        }
    }

I made a small improvement to optimize the search while parsing: Rust Playground

The main contribution here is that the parser tries not to re-parse characters more than once. The only case where it's necessary is for backtracking when an extension is found.

1 Like

It seems like the real implementation that I 'want' is to have a parser that works like take_until() but matches until a parser returns true on a substring, while iterating through "words" (chars surrounded by whitespace).

I made a combinator out of the code implemented above, but I'm a bit lost on how to best return an error from it (using ErrorKind::Fail here, to get it to compile. Any more insight from more capable 'nommers' than me? I feel like I'm kind of fumbling around

fn parse_plaintext_till(
    mut until: impl FnMut(&str) -> IResult<&str, Token>,
) -> impl FnMut(&str) -> IResult<&str, (Token, Token)> {
    move |input: &str| {
        for (current_index, _) in input.char_indices() {
            match until(&input[current_index..]) {
                Ok((new_input, result)) => {
                    let leading_text = &input[0..current_index];
                    return Ok((new_input, (result, Token::Plain(leading_text))));
                }
                Err(err) => {}
            }
        }
        Err(nom::Err::Error(nom::error::Error {
            input,
            code: nom::error::ErrorKind::Fail,
        }))
    }
}

That still looks like it's doing unnecessary parsing. You don't want to use a for loop if you can avoid it.

for i in input { &input[..i] } I believe technically runs in quadratic time.

You don't want to iterate over the range twice, just scan for the pattern you are interested in and return the token that matches it.

In my example, I had to backtrack to perfectly match the output of your initial parser. But if that is not a requirement you can remove the backtracking by matching on space/period characters with take_until() or take_till().

I also don't know why you would need to return an error, since the parser appears to match potentially every input. (Except maybe empty string? That decision is yours to make.)

Not by itself, as slicing is O(1) – it's simply 1 or 2 pointer arithmetic operations, after all.

However, you are right that this looks suspicious in general, since the rest of the loop body might still turn out to take time proportional to the length of the subslice.

2 Likes

Correct, the slicing itself is a constant time operation, but the assumption is that the slice is iterated inside until somewhere.