Regex workaround

moveax41h · September 15, 2018, 6:22am

I have a very long regex expression that I am trying to evaluate against some strings. I'm trying to check like 10 different regexes so I placed each one in a separate named capture group using (?exp)|<?exp) notation. I'm using a ton of different characters like |, ?, [, ], {, }, etc.. Somewhere along the way the regex engine is having a problem parsing this. This tells me that rust is successfully parsing the string and submitting it to the regex engine. Additionally, I've checked the regex on regex101.com and it works properly.

My other option is to use loops and create separete Regex::new()s for each separeate expression rather than have one big regex string. However, I thought that using one big regex string would be faster because I need to scan millions of lines of text and I'd rather submit the regex once per line rather than loop over like 15 different regexes for each line if that makes sense. I'm not entirely sure my hypothesis is correct but it seems so.

Do you have any suggestions for troubleshooting this or workarounds? Thanks.

krdln · September 15, 2018, 8:55am

Have you tried using RegexSet?

BurntSushi · September 15, 2018, 10:12am

I'm having a hard time understanding the problem you're facing. Could you please provide a program that reproduces the bad behavior along with any necessary input? What is the actual output? What is the expected output?

lambda · September 15, 2018, 4:44pm

Generally the best way to get solve a problem like this, whether on your own or when asking on a forum, is to try and reduce the problem down to a minimal example that demonstrates the problem. Take your big complicated regex, and keep trimming parts down until the problem goes away; then add that part back, and trim down the rest.

It is a little bit unclear from your question if the problem you are having is with compiling the regex (Regex::new("...").unwrap()), or with matching the text. If the problem is with matching, then also try to cut down the input to a small example that you think should match, but which doesn't.

This may be enough for you to solve the problem on your own. If not, it will give you something small enough that you can ask about it here.

For example, after this process, you should wind up with an example something like the following:

extern crate regex;
use regex::Regex;

fn main() {
    let re = Regex::new("(ab*)|(ba*)|(ca*)|(da*)").unwrap();
    for cap in re.captures_iter("abracadabra") {
        println!("Captures: {} {} {} {}", &cap[1], &cap[2], &cap[3], &cap[4]);
    }
}

Then your question might be something like:

I ran the above code, and got the following error:

thread 'main' panicked at 'no group at index '2'', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-1.0.4/src/re_unicode.rs:1019:32

To which the answer would be that only one of those capture groups actually matched anything in the string, so you have to check which capture groups actually contain a match with cap.get() or cap.iter() instead of just indexing with cap[i].

As a note on your question about regex101.com, each regular expression engine has slightly different syntax and behavior. That site does not support the Rust regular expression engine. The one on that site that is probably the closest is the golang one, since both the golang and Rust regular expression engines and syntax derived some inspiration from RE2, and both use finite automaton based engines rather than backtracking engines (which limit the features, such as not supporting back-references, but improve the worst-time performance).

@krdln's suggestion, of using a RegexSet, is a much cleaner way of matching multiple regular expressions simultaneously than combining them yourself. It is essentially equivalent to combing then yourself with capture groups, but you don't have to go to the extra effort of adding the extra syntax for capture groups.

And one final note, is that if you are having trouble with the syntax and your regex contains any backslashes, check to see if you're using raw strings; otherwise, you'll have to double up all of the backslashes. r"foo\.bar" can be easier to read and verify than "foo\\.bar".

moveax41h · September 16, 2018, 2:56am

lambda, this has got to be one of the best replies I've ever read to a coding problem. Thank you so much! I'm going to rework this and try again taking all of this into account.

I agree with you in simplifying the problem... Even just for debugging purposes, splitting up my checks is the smartest approach because if nothing else, it'll let me narrow down which specific regex string is causing the prorblem rather than having to try and find the parse error in a ridiculously huge regex. Thanks again!!

trentj · September 16, 2018, 4:47am

I just want to respond to this:

This is not, in general, true. It does depend on exactly how your regex engine works, but consider a pattern like /apple|pear/, where you only expect one of the alternatives to match. A standard regex will go character by character through the string trying to match each part of the pattern at each index. So for the string pineapple it will try to match /apple/ to pineapple and fail, then try to match /pear/ to pineapple and fail, and start back at ineapple with /apple/ and so on. After it fails to match anything starting in pine, which could take a very long time if you have many or complicated alternatives, it will finally try to match /apple/ to apple and succeed. Whereas if you first tried to match /apple/ by itself, you would have succeeded in (less than) half the time, and you never would have had to worry about /pear/ at all.

A regex can utilize various clever strategies to improve performance, and I don't know to what extent the regex crate implements any of those, so don't take the above example as authoritative for "how things work in Rust". I just wanted to illustrate that alternation within a single regex can be significantly slower than matching several regexes one at a time. (Incidentally, this also demonstrates why, when matching multiple regexes of which only one is expected to succeed, you should put the one that is most likely to succeed first.)

RegexSet, from my understanding, is used in a different situation: where you have several expressions and you want to find all matches. I guessed from your question that you have a bunch of expressions and you want to find the one match, in which case RegexSet is probably fairly slow. But if performance is important, you shouldn't take my word for it; definitely benchmark it and measure the difference.

BurntSushi · September 16, 2018, 11:48am

trentj:

This is not, in general, true. It does depend on exactly how your regex engine works, but consider a pattern like /apple|pear/ , where you only expect one of the alternatives to match. A standard regex will go character by character through the string trying to match each part of the pattern at each index. So for the string pineapple it will try to match /apple/ to pineapple and fail, then try to match /pear/ to pineapple and fail, and start back at ineapple with /apple/ and so on. After it fails to match anything starting in pine , which could take a very long time if you have many or complicated alternatives, it will finally try to match /apple/ to apple and succeed. Whereas if you first tried to match /apple/ by itself, you would have succeeded in (less than) half the time, and you never would have had to worry about /pear/ at all.

This is probably how a backtracking engine will work. The regex crate is based on finite automata, which does not work like this and matches all of the alternations at once.

There are some subtleties to this, for example, there is actually a bounded backtracking engine in the regex crate, but it is only used for small regexes on small inputs. Also, when doing a linear time NFA simulation via the PikeVM, the number of alternations does impact match time because the simulation still needs to follow the alternations. However, it's still much more efficient than the backtracking approach because it prunes unmatchable paths as it moves forward.

When the DFA is used, and after the part of the DFA for the alternations has been constructed, then it has your typical "one byte moves to the next transition" cost model.

That's correct. RegexSet does look for all of the matches if they are requested (so matches but not is_match). Benchmarking is good.

In any case, this honestly sounds like a problem with just parsing a regex, so I wouldn't worry too much about the intricacies of the regex engine here. @moveax41h If you're getting an error, does it tell you where the parser error is? It should. Could you please show your code with your regex?

Topic		Replies	Views
Regex match issues help	4	1575	January 12, 2023
Request for review: easy extraction of parts of string using regexes code review	5	982	March 4, 2021
PSA: regex got a lazy DFA. it's fast. report problems!	3	1846	January 12, 2023
Help with Regex in Rust help	3	309	September 6, 2023
Problem with regex capture?. What am I missing? help	5	3797	September 5, 2020

Regex workaround

Related Topics