'generic' regex?

Is there an easy way to do regex over arbitrary types, rather than just u8/char ? crates.io: Rust Package Registry shows two libraries, but neither seems to have many downloads.

By generic regex, I mean the following problem. (Here, 'Token' has nothing to do with the `'token' in lexing/parsing).

pub enum Token { ... }

We can now have literals x: Token, a class x: HashSet<Token> or fn(Token) -> bool.

So we can define something like:

pub enum R { // regex over token
  Class(HashSet<Token> or fn(Token) -> bool),
  Seq(Box<R>, Box<R>)

then we can ask questions of the form: given v: Vec<Token> and r: R, does r match v ? (and if so, what are the capture groups?).

Question: is there a library to do this easily ?


1 Like

While I see the advantages of having a generic Regular Expression type, it seems like what you are describing might be more fit for full-on parser. To my knowledge nom parser combinators are general enough to apply to most types.

In theory, I agree with you.

In practice, I do not know what algorithm nom uses, nor the full class of languages it parses, but generally, increased expressivity comes at the price of increased time or memory usage (compared to regex).

If such a crate as you asked for exists, you also wouldn't know the cost tradeoffs or risks without researching deeper[1] -- it differs depending on the regex flavor and features used. Some discussion here and where it links to.

  1. unless they cared deeply enough about the topic to document it like BurntSushi has ↩ī¸Ž

I believe:

  • state of the art for regex is (1) regex -> NFA, (2) don't do NFA -> DFA [can cause exponential blowup], and (3) instead, use a bitvec to track the set of states we are in for NFA, ... then run through the input.

I believe the total running time here is O(M * N) where N = len(input), M = # of NFA states = O(length of regex in strings).

I believe the memory usage is O(M), O(M) for storing the NFA, and just marching through it.

  • In contrast, for something like packrat parsing (I'm not saying nom uses this; but many general parsers do), the space usage ends up being something like O(M * N), due to the need of storing the entire state (so instead of doing back tracking, we an just do dynamic programming / lookup the cell).

My point here is: although I agree with you research is always needed, there is a significant difference between the memory usage of the state of the art for regex vs packrat parsing.

I'm not going to take the time to continue this side-bar beyond this comment, but my experience is the other way around. Most regex engines I've encountered, including the built-in or go-to engines for most programming languages, sport features like backtracking that make exponential blowup possible.

At any rate, "it's regex therefore the costs are self-evident" isn't true without more qualifiers.

  1. You can check whether a NFA (non deterministic finite automata) accepts/rejects a string s WiTHOUT backtracking. This is done by maintaining the set of all possible states the NFA could be while processing s[0 .. i], then moving on to s[0 .. i+1]. See, for example: automata - NFA string acceptance with loop of epsilon transitions - Computer Science Stack Exchange

  2. The Regex template in my original question can be translated into NFA.

  3. I will concede there may be extensions of regex that does require backtracking. However, what I described above: literal, seq, +, *, ?, can be converted to NFA.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.