Detect Regex conflict

Hello !

I am trying to make a lexer than can be built at runtime, and I am looking for a specific functionality: the ability to tell if 2 regexes conflict with each other.

I am using regex, and I would like to be able to tell if 2 regexes r1 and r2 can match the same expression. I can't find it in the regex crate, but maybe there is a helper that implements that ?

Is it possible to rephrase your problem at a higher level than this? If so, it might be the case that you don't need to actually determine precise equality.

The conventional way to do this is to compile each regex into a DFA, minimize it and then compare the resulting DFAs themselves for equality. If they compare equal, then you can conclude that they match the same things.

While the regex crate is indeed based on finite automata, it does not currently ever compile a regex to a full DFA due to the practical problems with such things. (It may in the future, but only in a subset of cases in a constrained way that sidesteps the practical issues.) Therefore, the regex crate cannot perform this kind of comparison for you.

The regex-automata does however compile regexes into full DFAs and even provides minimization. It also provides access to the lower level DFA state transitions. It does not implement equality (perhaps it should), but it would in theory be possible to do this yourself. Of course, this comes with all the disadvantages of building full DFAs. (Which are documented extensively in the crate docs.)

2 Likes

I believe they want to ask "is the intersection of these two regular expressions empty?".

Exactly !

I am using the logos crate and I'm very happy with it, but it only allows making lexers at compile time, and I need to build one at runtime (based on some user input).

logos uses a kind of token disambiguation to decide which token to parse next. If this is not enough, it will refuse to compile and the user will be asked to provide custom priorities.
Based on this, I assume it is able to detect when two regexes with the same priority might overlap, so I wanted to do the same thing.

(I could go and try to read logos's code, but I was hoping for an easy answer :stuck_out_tongue:)

For example, [a-zA-Z]+ and [a-z]+ both match "abc" (and have the same priority in logos), so I would like to raise an error.

Not an up-front approach, but could you just defer this decision to lexing time? I.e. try to match the next token with each regex; if more than one of them match, then output an ambiguity error.

That's kind of what I am doing at the moment, but I am not exactly satisfied. First because I think it is much better to be warned ahead of time (a kind of static analysis vs runtime error thing... and I'm here using Rust, so :grin:).
But also I would like to make it easy to switch to logos when possible, and it would be pretty awkward to have the lexer stop building because it wasn't detecting a conflict :slightly_frowning_face:

I think I am close to a solution with regex-automata (using the transducer feature):

use fst::Automaton;
use regex_automata::RegexBuilder;

pub fn regex_overlap(pattern1: &str, pattern2: &str) -> bool {
    let mut builder = RegexBuilder::new();
    builder.minimize(true);

    let re1 = builder.build(pattern1).unwrap();
    let re2 = builder.build(pattern2).unwrap();
    let dfa1 = re1.forward();
    let dfa2 = re2.forward();

    let intersection = dfa1.intersection(dfa2);
    intersection.can_match(&intersection.start())
}

This does not work, but I feel like it could :slightly_smiling_face:

My guess is it has something to do with anchoring the beginning or end. Do you have some test cases?

Looking at the source code, this will always return true unless one of the starting states of dfa1 and dfa2 is a dead state.

IMO the interface exposed by regex_automata is not enough, you need something that exposes the possible transations from a certain state, then use that to build the actual intersection DFA and check if that has any final state.

1 Like

It does expose the possible transitions from a state via the DFA trait: regex_automata::DFA - Rust

The fst::Automaton trait does as well (which is only implemented when the transducer feature is enabled): fst::Automaton - Rust

2 Likes

It does seem to work !

use regex_automata::{RegexBuilder, DFA};
use std::collections::HashSet;

/// Returns `true` if the two regexes overlap
pub fn regex_overlap(pattern1: &str, pattern2: &str) -> bool {
    let mut builder = RegexBuilder::new();
    builder.minimize(true).anchored(true);

    let re1 = builder.build(pattern1).unwrap();
    let re2 = builder.build(pattern2).unwrap();

    let dfa1 = re1.forward();
    let dfa2 = re2.forward();

    let mut visited_states = HashSet::new();
    let mut to_process = vec![(dfa1.start_state(), dfa2.start_state())];
    visited_states.insert((dfa1.start_state(), dfa2.start_state()));
    if dfa1.is_match_state(dfa1.start_state()) && dfa2.is_match_state(dfa2.start_state()) {
        return true;
    }
    while let Some(current_state) = to_process.pop() {
        for input in 0..=u8::MAX {
            let next_state = (
                dfa1.next_state(current_state.0, input),
                dfa2.next_state(current_state.1, input),
            );
            if dfa1.is_match_state(next_state.0) && dfa2.is_match_state(next_state.1) {
                return true;
            }
            if !visited_states.contains(&next_state) {
                visited_states.insert(next_state);
                to_process.push((
                    dfa1.next_state(current_state.0, input),
                    dfa2.next_state(current_state.1, input),
                ));
            }
        }
    }

    false
}

#[test]
fn overlapping_regexes() {
    let pattern1 = r"[a-zA-Z]+";
    let pattern2 = r"[a-z]+";
    assert!(regex_overlap(pattern1, pattern2));
    let pattern1 = r"a*";
    let pattern2 = r"b*";
    assert!(regex_overlap(pattern1, pattern2));
    let pattern1 = r"a*bba+";
    let pattern2 = r"b*aaab+a";
    assert!(regex_overlap(pattern1, pattern2));
    let pattern1 = r" ";
    let pattern2 = r"\s";
    assert!(regex_overlap(pattern1, pattern2));
    let pattern1 = r"[A-Z]+";
    let pattern2 = r"[a-z]+";
    assert!(!regex_overlap(pattern1, pattern2));
    let pattern1 = r"a";
    let pattern2 = r"b";
    assert!(!regex_overlap(pattern1, pattern2));
    let pattern1 = r"a*bba+";
    let pattern2 = r"b*aaabbb+a";
    assert!(!regex_overlap(pattern1, pattern2));
    let pattern1 = r"\s+";
    let pattern2 = r"a+";
    assert!(!regex_overlap(pattern1, pattern2));
}

Unfortunately every possible transition in u8 has to be tested, so it might not always perform well. But that's not that big an issue for me (and well, this test still completes in <0.1s in debug on my machine, so good enough).

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.