JavaScript regular expression

I want a regular expression API that is compliant as much as possible to JavaScript's RegExp, implementing std::pattern::Pattern, with a few allowances:

  • \u{...} without needing to specify u or v (unicodeSets) flags
  • \p{...} without u or v flags

I've read the MDN documentation:

However, I've not used v ever. Is there a source code for a well-working RegExp implementation in any language that I can "easily" port to Rust?

Can you say why you want this? And other than it not being exactly compatible with ecmascript regexes, why doesn't the regex crate work for you?

And also, why doesn't the regress crate work for you?

regex does miss features currently which can be added, but I'm not sure regex is implemented in a compliant way. I've used regex a little, including for a replace with a callback receiving captures.

I got interested in regress, but I discovered it was the author's first "Rust" project and I also saw it doesn't implement Pattern. With Pattern, it could be used with str's native methods...

AFAIK regex uses different syntax for backreferences too. I just want to expose a regular expression API in a framework that is the same as JS (compatible, I mean).

Oops! I meant capture groups in that case:

image

JS uses (?<Name>x)

Err, I guess I'll end up using regex really and wrap it into my type. I think it doesn't matter after all to have JS compatibility in my framework at all

So first of all, regress is certainly not ridiculousfish's first project. I have no idea where you got that from. ridiculousfish is a credible and trustworthy open source contributor.

Secondly, I'm the author of regex. I understand it isn't compatible with ECMAScript. What I'm asking you is why you need compatibility with ECMAScript regexes.

Thirdly, the Pattern trait is nightly only. Implementing it doesn't give you any new expressive power. It's at best a mild convenience. Not using a regex engine and implementing your own just because of the Pattern trait doesn't strike me as the wisest of decisions. :slight_smile:

Fourthly, the regex crate supports both (?<name>re) and (?P<name>re) syntaxes.

Right. It probably doesn't. It usually doesn't. Not always. If you don't have a specific need to be ECMAScript compliant, then just use the regex crate.

6 Likes

I saw that, isn't it right?

image

I see, then that crate was built when Pattern was still unstable? Anyway, one of my framework's crates already builds only on nightly though.

I guess it's their first Rust code, but definitely not first project. It's a high quality crate with a good implementation.

Pattern is still unstable and AFAIK has no path to stability. Again, I'd recommend staying away from it. It will just make code harder to migrate to stable Rust if and when you do that in the future. You really should only be using unstable Rust features if you have an extremely compelling reason to do so. Using the Pattern trait is not a compelling reason.

5 Likes

Many developers work with multiple languages, e.g. a Rust backend with HTML/JS user interface. Having distinct syntaxes in each language can get pretty confusing.

1 Like

There's pretty much never been one universal Regex syntax or set of capabilities. It's not a JS or Rust or crate specific situation.

3 Likes

Well, then please don't add yet another flavor to the pile. It makes writing code harder.

Nobody here is proposing to add a new flavor.............

Besides, there are only two popular specifications for regex that I'm aware of. POSIX and ECMAScript. Both have significant shortcomings.

PCRE is another widely used standard. Perhaps not officially standardized somewhere, but built into PHP, Perl, Apache, Nginx, R, sed and Grep, a Python library exists, so certainly a de-facto standard.

It's not a standard or specified anywhere. And PCRE and Perl have a large swath of differences, despite the fact that PCRE is called "Perl compatible." Moreover, PCRE, POSIX and ECMAScript all fundamentally share the most important downside: their worst search times are exponential because none of them implement regular languages. They all require some feature (like back-references or look-around) that are not known how to implement efficiently. (EREs in POSIX don't require back-references, but I believe most implementations provide them. Which further shows that specifications are not enough to give you a guarantee of uniformity.)

And grep does not have PCRE. It gas POSIX regexes. Some implementations of grep have PCRE that you can opt into, such as GNU grep. But GNU grep has three distinct flavors of regexes built into it. (BREs, EREs and PCRE.)

The only reason you see PCRE as a "de facto" standard is because its implementation is re-used in several places. It's not because people have re-implemented it various places.

None of PCRE, POSIX or ECMAScript provide the requirements necessary to implement regex engines that aren't susceptible to ReDoS. You could implement EREs strictly, but oops, that rules out environments that require UTF-16 such as Java and C# because none of POSIX can support UTF-16.

There is only one reasonable solution at this point: don't assume that all regex flavors are the same. Thankfully, most popular regex flavors share a lot more similarities than differences.

You seem to also be missing my most important point: nobody here has suggested adding a new regex flavor. The only suggestions have been for re-using existing flavors.

8 Likes
$ man grep | grep -A2 -- '-P,'
       -P, --perl-regexp
              Interpret PATTERNS as Perl-compatible regular expressions (PCREs). [...]

Obviously, this qualifies as "built into grep".

In the open source world, an implementation is even better than a committee-standard. Source code means there is no need to re-implement anything. Just link and enjoy proven code with guaranteed compatibility. Behavior is clearly defined, it's well documented, widely adopted, developers know it, developers use it, great!

Regarding the overall post: that's exactly what I feared. First you say you don't want to invent yet another flavor, then you find lots of excuses why every existing flavor would be flawed. Only possible outcome is, well, another flavor.

Maybe you could revive one of the abandoned PCRE wrappers, or create and maintain a new one.

You quoted me out of context. I also said, "Some implementations of grep have PCRE that you can opt into, such as GNU grep."

ReDoS isn't an "excuse." And you're wildly misreading what I said. I didn't say "every existing flavor was flawed." I said that PCRE and the only two popular specifications for regex (POSIX and ECMAScript) had fundamental flaws.

The RE2 flavor was released in 2010. The regex crate and the standard library regexp package in Go both closely adhere to the syntax of RE2. But just like Perl and PCRE aren't the same, RE2 and regex crate aren't the same either. The RE2 flavor does not require features that aren't known how to implement efficiently, and thus, it is a good mitigation against ReDoS.

I personally feel like this conversation with you has no back-and-forth at all, so I'm going to end it here and mute this thread.

Once again, nobody here has suggested adding a new flavor.

I maintain the pcre2 crate wrapper for use inside of ripgrep.

6 Likes

I just wrote a RE parser in Rust, though it was an exercise and never been used by anyone but me. It works, and the frontend/backend design allows adding different pattern engines (it currently has one that is pretty standard emacs regexp and one is a new design that seems easier to me) It's at GitHub - russellyoung/regexp-rust: Second Rust project: regular expression searcher. The source code is all there, though as a first (well, second) Rust project probably it could be improved.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.