Regex match issues


#1

I am running into an issue trying to extract data from some text using the regex crate.

Here is the regex that I am trying to get to work:

let re = Regex::new(r"(?P<re>(-?\d{1,}(\.\d{1,})?\s){4}?re)|
                        (?P<h>(-?\d{1,}(\.\d{1,})?\s){2}?m\s((-?\d{1,}(\.\d{1,})?\s){2}?l\s){3}?h)").unwrap();

If I make two separate regexes, one for re and one for h, it finds the right information but when joined together in the same regex only the re data is found. I have tested my regex in python, see https://regex101.com/r/pN8vH1/4 for an example.

Does my regex need to be tweaked or is this an issue with the rust regex implementation? Any help will be appreciated. I am not a regex expert and have been pulling my hair out the past few days trying to get to the bottom of this.


#2

It looks like it’s because you have space in your regex, between the | and the beginning of the (?P<h>...) capture group. If you remove the space, it matches fine:

extern crate regex;
use regex::Regex;

fn main() {
    let re = Regex::new(r"(?P<re>(-?\d{1,}(\.\d{1,})?\s){4}?re)|(?P<h>(-?\d{1,}(\.\d{1,})?\s){2}?m\s((-?\d{1,}(\.\d{1,})?\s){2}?l\s){3}?h)").unwrap();
    let data = r"18 708.96 568.56 0.48 re f 299.28 36 0.48 673.44 re f 18 645.12 274.56 0.48 re f 18 374.16 274.56 0.48 re f 18 206.64 274.56 0.48 re f 309.6 530.16 274.56 0.48 re f 309.6 374.16 274.56 0.48 re f q 1 0 0 1 0 0 cm BT -0.0074 Tc 9.940797 0 0 9.96 30 747.12 Tm /Tc1 1 Tf [ (Le) -10 (o) -14 (n) -2 (e) -10 ( ) 1 (T) 12 (i) -7 (m) 10 (i) -7 (n) -2 (g) -2 ( ) -11 (a) -10 (n) -2 (d) -14 ( ) -11 (R) -4 (e) -10 (s) -5 (u) -2 (l) -7 (t) -7 (s) -5 ( ) -11 (S) -7 (e) -10 (r) -12 (v) -2 (i) -7 (c) -10 (e) -10 (s) -5 ( ) -11 (- ) -11 (C) -4 (o) -14 (n) -2 (t) -7 (r) -12 (a) -10 (c) -10 (t) -7 (o) -14 (r) -12 ( ) -11 (Li) -7 (c) -10 (e) ] TJ ET Q q 1 0 0 1 0 0 cm BT -0.0051 Tc 9.940797 0 0 9.96 242.0404 747.12 Tm /Tc1 1 Tf [ (ns) -2 (e) -15363 (H) -7 (y) 12 (-) 2 (T) 51 (e) -8 (k\') 6 (s) -2 ( ) -9 (M) -9 (E) -10 (E) -10 (T) -10 ( ) -9 (M) -9 (A) 5 (N) -7 (A) 5 (G) -7 (E) -10 (R) -2 ( ) -9 ( ) -9 (P) -17 (a) -8 (ge) -8 ( ) -9 (1) ] TJ ET Q q 1 0 0 1 0 0 cm BT -0.0007 Tc 9.940797 0 0 9.96 200.5204 735 Tm /Tc2 1 Tf [ (SU) -3 (N) -3 (Y) 94 (A) -3 (C) -3 ( ) -4 (O) -7 (ut) -6 (do) -8 (o) -8 (r) 9 ( ) 8 (T) 75 (r) -3 (a) -8 (c) -3 (k) 12 ( ) -4 (&) -1 ( ) -4 (F) -5 (ie) -3 (ld ) -4 (C) -3 (ha) -8 (m) 24 (pio) -8 (ns) 2 (hips) ] TJ ET Q q 1 0 0 1 0 0 cm BT -0.0007 Tc 9.940797 0 0 9.96 221.2804 722.76 Tm /Tc2 1 Tf [ (SU) -3 (N) -3 (Y) 33 ( ) -4 (B) -10 (r) 9 (o) -8 (c) -3 (k) 12 (po) -8 (r) -3 (t) -6 ( ) -4 ( ) -4 (-) -6 ( ) -4 (5) -8 (/1) -8 (/2) -8 (0) -8 (1) -8 (5) -8 ( ) -4 (t) -6 (o) -8 ( ) -4 (5) -8 (/2) -8 (/2) -8 (0) -8 (1) -8 (5) ] TJ ET Q Q q 290.52 720.24 m 321.48 720.24 l 321.48 709.44 l 290.52 709.44 l h W n";

    for caps in re.captures_iter(data) {
        println!("{:?} {:?}", caps.name("re"), caps.name("h"));
    }
}

Gives me:

Some("18 708.96 568.56 0.48 re") None
Some("299.28 36 0.48 673.44 re") None
Some("18 645.12 274.56 0.48 re") None
Some("18 374.16 274.56 0.48 re") None
Some("18 206.64 274.56 0.48 re") None
Some("309.6 530.16 274.56 0.48 re") None
Some("309.6 374.16 274.56 0.48 re") None
None Some("290.52 720.24 m 321.48 720.24 l 321.48 709.44 l 290.52 709.44 l h")

Hmm. Actually, I realize that my rust-playground crate that I use for testing one-off things like this was using an older version of regex. When I updated it so I could demonstrate the (?x) flag that allows insignificant whitespace (allows you to include extra whitespace for making the regular expression more readable, that won’t be considered when matching), I found that while it still matches, it doesn’t capture the h group properly:

Some("18 708.96 568.56 0.48 re") None
Some("299.28 36 0.48 673.44 re") None
Some("18 645.12 274.56 0.48 re") None
Some("18 374.16 274.56 0.48 re") None
Some("18 206.64 274.56 0.48 re") None
Some("309.6 530.16 274.56 0.48 re") None
Some("309.6 374.16 274.56 0.48 re") None
None None

By bisecting my regex version dependency, it looks like this happened between versions 0.1.30 and 0.1.32 (0.1.31 was yanked and so no longer available).


#3

Slightly more minimized example that still demonstrates the bug:

    let re = Regex::new(r"(?P<re>(\d{1,}(\.\d{1,})?\s){3}?re)|(?P<h>((\d{1,}(\.\d{1,})?\s){2}?l\s){3}?h)").unwrap();
    let data = r"290.52 720.24 m 321.48 720.24 l 321.48 709.44 l 290.52 709.44 l h";

If you change the {3} to a {2} in the re group, you get a partial capture in the h group, and if you change it to {1} (or remove it entirely), you get the full capture. I would keep working on minimizing it down in order to submit a good bug report, but don’t have the time right now; I think this looks like a bug, so I’d recommend trying to minimize it down a little further and filing an issue.


#4

Thanks @lambda for narrowing it down. I filed an issue: https://github.com/rust-lang-nursery/regex/issues/129