Rust regex replace_all slower than PHP regex preg_replace_callback, how to optimize?

https://stackoverflow.com/questions/58580753/rust-regex-replace-all-slower-than-php-regex-preg-replace-callback-how-to-optim

PHP 5.5 times faster than Rust in the example below.

Am I doing something fundamentally wrong?

To me it seems that the regex engine in Rust is simply slower than it is in PHP.

PHP code:

$html = file_get_contents('/path_to/test.html');

global $c_id;
$c_id = 0;

echo 'len with comments: ', strlen($html), "\n";

$time_start = microtime(true);

$html = preg_replace_callback('/<!--(.*?)-->/s', function($matches) {
    global $c_id;
    $c_id++;
    return str_replace($matches[1], $c_id, $matches[0]);
}, $html);

echo (microtime(true) - $time_start), " seconds for removing comments.\n";

echo 'len without comments: ', strlen($html), "\n";

Rust code:

use regex::Regex;
use std::io::prelude::*;
use std::fs::File;

fn main() {
    let mut file = File::open("/path_to/test.html").expect("Unable to open the file");
    let mut html = String::new();
    file.read_to_string(&mut html).expect("Unable to read the file");
    let mut c_id: usize = 0;

    println!("len with comments: {}", html.len());

    let start = PreciseTime::now();

    let re = Regex::new(r"(?s)<!--(.*?)-->").unwrap();
    html = re.replace_all(&html, |captures: &regex::Captures| {
        c_id += 1;
        captures[0].replace(&captures[1], &c_id.to_string())
    }).to_string();

    println!("{} seconds for removing comments.", start.to(PreciseTime::now()));

    println!("len without comments: {}", html.len());
}

Rust dependencies:

regex = "1"
time = "*"

Results

PHP:

len with comments: 76968
0.00019717216491699 seconds for removing comments.
len without comments: 76622

Rust:

len with comments: 76968
PT0.001093365S seconds for removing comments.
len without comments: 76622

Thanks!

1 Like

Building a regex is an expensive operation that you should do once. You can then use that regex as many times as you need, and it will likely outperform php in replacing all the captures.

So measure the replace_all called many times using the same regex. (On possibly different inputs)

1 Like

Try with regex::bytes
to_string() to into_owned()
A larger sample since your times are low.
Make sure the php code isn't using prebuilt regex (no idea how) or take it out from rust time.

1 Like

One more thing, php doesn't try and make the fastest regex matcher without the study flag, so that it can be used in one off cases like this wothout giving up performance, because analyzing a regex is expensive.

Rust does do this analysis to try and make the most efficient regex possible.

Also, preg* caches all regexes that it uses in a thread local, so calling it multiple times won't recompile the regex.

A more fair benchmark would run replace_all a number of times to amortize the cost of compiling the regex, or just time replace_all by compiling the regex beforehand.

1 Like

What I suspect is that in this case PHP is "interpreting" the regex whereas Rust is "compiling" it, and the one-time overhead of initializing/compiling the regex in Rust makes it slower than the PHP interpretation when only measuring once for the entire process.

Rule 1 of microbenchmarking: you aren't measuring what you think you're measuring.

If you really only want to do it once and want Rust to win this microbenchmark is to use regex_atomata to precompile the regex before running the binary. (Disclaimer: may not be perfectly translatable as regex has some additional non-regex optimizations like literal optimizations.)

But what you really need is to not parse html with regex. You may think that this removes all HTML comments, but you'd be wrong. The only way to truly know how a browser parses a potentially adversarial snippet is to actually give it to the browser.

7 Likes

I have measured the regex compiling and it takes very little time. If I start the timer after initializing the regex the result is the following:

PT0.000794847S seconds for removing comments.
compared to
PT0.001093365S seconds for removing comments.

not bad, but still very slow.

Thanks for your reply. I'm very new to Rust, can you share a piece of code to do replacements using regex_automata?

As the performance guide states, asking for capturing groups uses the slowest code in the regex engine: regex/PERFORMANCE.md at master · rust-lang/regex · GitHub

There really isn't much you can do here other than to stop using capturing groups. In this case, you can actually do that because the beginning and end of every match have a fixed size.

I have ideas on how to make capturing groups faster, but it isn't happening any time soon. If you're desperate, then use the pcre2 crate.

6 Likes

Thanks Andrew for your input!

I have noticed that compiling the regex takes about the same time as PHP to complete the task. So even speeding up captures is not going to come close to PHP results. I'm new to Rust and I was surprised that in Rust the regexes are compiled during run-time. Are there any solutions to pre-compile them?

Thanks! I have tried using regex::bytes and I can see some improvements (from 0.001093365 down to 0.000670200). However after measuring the regex compilation during the run-time I have found out that it takes about the same time as PHP to complete the task.

PCRE compiles regexes at runtime too, so I'm not sure why you're surprised that Rust's regex library would do the same.

Are there any solutions to pre-compile them?

No, unless your want to use regex-automata like another poster suggested. You could try prefixing your regex with (?-u) to actually make your regexes equivalent. (Rust's regexes are Unicode aware by default. PCRE is not.) But honestly, at those times, we are talking about pretty small potatoes. You haven't shared what problem you're trying to solve yet, so it's not even clear whether regex compilation is an important factor to focus on. There may be a 4-5x difference in compiles times here, but the absolute difference is overall pretty small, unless you're compiling a lot of regexes in a loop. But your example code has a static regex, so such a thing is not necessary.

Also, it seems like you ignored my suggestion to avoid using capturing groups at all. Why is that?

4 Likes

I have seen several people benchmarking rust forget to compile in release mode. Did you compile in release mode?

2 Likes

Sorry Andrew for not responding to that part. This regex is just a demonstration. I'm not using it in a loop. I have also tried with regex::bytes and it reduced the time, but still it doesn't compete with PHP yet which indeed uses pcre2 with jit. I'll try to do the same with pcre2 in rust, but the interface to it is really low level, it seems I have to read the PHP source codes to see how it uses pcre2 library functions and do the same in rust. If you have some code snipped, I'll be happy if you share it.

My understanding is that compiling a regex results in an object in memory which is later being used by regex functions. So why the compiler cannot keep a serialized version of that object and use it later on run-time. Please correct me if I'm wrong.

Thanks!

Yes. Thanks!

1 Like

There used to be a macro for that, but it was nightly only and it was removed.

The interface is pretty much identical to what's in regex. It even has an explicit option for enabling the JIT: pcre2::bytes::RegexBuilder - Rust

The only thing that's really missing from the pcre2 crate is a replace_all method. You should be able to write that yourself.

PCRE2 does the same thing.

Because it's hard and I have a finite number of hours to dedicate to open source work.

The absolute execution time difference in your initial post is less than 1 millisecond. If you aren't doing this in a tight loop, then this difference shouldn't matter. But you haven't said what the actual problem you're trying to solve is, so it's hard to say with any certainty.

4 Likes

I appreciate your input. I'll try to use pcre2 and share the results.

Basically I'm running 100s of different regexes and doing replacements on captures and this is just an example. I'm learning Rust by converting a PHP code into Rust and of course I want the end result to be very fast (at least 10x).

Just remember that running a regex isn't PHP, you are calling out to PCRE2, which is a C library JIT Regex Engine, that's not even remotely in the world of PHP. ^.^

But still, rust can do it all in-rust with some changes, or just calling out to the pcre2 crate to use pcre2 itself would be fine as well.

3 Likes

Correct, but this should be at least at the same speed and not slower. I presumed earlier that it should be faster, because Rust should do the regex compilation during compile time and not during run-time like PHP does. Turns out it is not the case currently and I'll be happy to get the same speed level. I'm trying now with pcre2 crate. Thanks!

1 Like

I have new results with pcre2::bytes::Regex.

Sorry if I have used novice skills in my code and it is not so pleasant to read.

use time::PreciseTime;
use pcre2::bytes::RegexBuilder;
use std::io::prelude::*;
use std::fs::File;

fn main() {
    let mut file = File::open("/path_to/test.html").expect("Unable to open the file");
    let mut html = String::new();
    file.read_to_string(&mut html).expect("Unable to read the file");
    let mut c_id: usize = 0;

    println!("len with comments: {}", html.len());

    let start = PreciseTime::now();

    let re = RegexBuilder::new().dotall(true).build(r"<!--(.*?)-->").unwrap();

    //println!("{} seconds for compiling regex.", start.to(PreciseTime::now()));

    let html_bytes = html.into_bytes();

    let mut it = re.captures_iter(&html_bytes).peekable();
    if it.peek().is_some() {
        let mut new:Vec<u8> = Vec::with_capacity(html_bytes.len());
        let mut last_match = 0;
        for cap_res in it {
            match cap_res {
                Ok(cap) => {
                    let m = cap.get(0).unwrap();
                    //println!("m: {} -> {}", m.start(), m.end());
                    new.extend_from_slice(&html_bytes[last_match..m.start()]);

                    c_id += 1;
                    //let new_slice = ??? replace cap[1] with c_id in cap[0] ???
                    let new_slice = format!("<!--{}-->", c_id).into_bytes();

                    new.extend_from_slice(&new_slice);
                    last_match = m.end();
                },
                Err(_) => {}
            }
        }
        new.extend_from_slice(&html_bytes[last_match..]);

        let html = String::from_utf8(new).unwrap();
        println!("{} seconds for removing comments.", start.to(PreciseTime::now()));
        println!("len without comments: {}", html.len());
    }
}

Result:

len with comments: 76968
PT0.000484981S seconds for removing comments.
len without comments: 76622

EDIT: Ah, forgot to enable JIT.

let re = RegexBuilder::new().dotall(true).jit(true).build(r"<!--(.*?)-->").unwrap();

Now I finally get the same speed:

Result:

len with comments: 76968
PT0.000193024S seconds for removing comments.
len without comments: 76622

I appreciate your time.

EDIT2: Interesting observation:

PT0.000047746S seconds for compiling regex without jit.

PT0.000102026S seconds for compiling regex with jit.

In theory if the compilation time can be removed from run-time of Rust it can result to 0.000090998 sec, which will be 2x faster than in PHP.

2 Likes