Matching a regex at a particular index?

I’d like to get the captures of all named groups at a particular index.

.captures_read_at() is close but:

  • I can’t anchor the match to the start index. That is, if there is no match at the index, the search continues for a match at a higher index.
  • I can’t retrieve named captures.

Some languages (e.g. JavaScript) support the flag y (or sticky) for this kind of thing.

My use case is tokenization, but with more complex patterns than supported by logos (for some tokens, I want to extract what’s inside them).

Related discussion from 2018. I’m hoping that something has changed since then.

I think this is what you actually want, at least, modulo sticky / end-of-last-match-anchor (\G) / some strictly-consecutive-matches mode.

1 Like

The only way to do this with the regex crate API is to add ^ to your regex pattern.

You can, it just isn't convenient. You'll need to use Regex::capture_names to build a map from name to index. (This is what Captures does internally.)

1 Like

It looks like that’s not supported by (e.g.) Regex in fancy_regex - Rust

There, ^ only matches if pos is 0.

I take it back, captures_at still isn't sufficient for all cases.

In the case of regex which can start with context sensitive assertions (like \b), you would need \A(?:original_regex) to match on the substring, and then to run the original regex again with captures_read_at to confirm (which would not be anchored and may run on). Example.

So I think there's currently no efficient way in the most general case without a captures_at_exactly or similar.

If you don't have such leading context sensitivity, I believe you can use the anchored version on a substring approach. (E.g. remove the \b in my playground and the two runs will agree; you can use "naive anchoring".)

...for the Regex crate.


@rauschma, if you're using fancy_regex (which has different maintainers), it looks like they support the \G anchor. I think that will do what you want, at least, it looked like sticky was the same as adding \G to the beginning of your regex to me. (It's not on the playground so it's not easy for me to check right now.)

I also have no idea how they implemented it (e.g. efficiency-wise).

1 Like

Brilliant, thanks! That does indeed work:

use fancy_regex::Regex;
fn main() {
    let re = Regex::new(r"\G((?P<emph>_)|(?P<ex>x))").unwrap();
    let text = "_x x";

    let mut index = 0usize;
    loop {
        match re.captures_from_pos(text, index).expect("TODO") {
            None => {break}
            Some(captures) => {
                let m = captures.get(0).unwrap();
                println!("TOKEN: {}", &text[m.start()..m.end()]);
                index = m.end();
            }
        }
    }
}

Desired output (thanks to \G):

TOKEN: _
TOKEN: x

Output without \G:

TOKEN: _
TOKEN: x
TOKEN: x

Right. See: Config in regex_automata::hybrid::dfa - Rust

4 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.