Cascade of regex replacements


#1

Is the following code the recommended/most efficient way to apply a cascade of regular expression replacements? In particular, should I worry about the calls to .into_owned() (can they lead to allocations which could otherwise be avoided?) or not?

extern crate regex;

use regex::{Regex};

fn main() {
    let re1 = Regex::new("foo").unwrap();
    let re2 = Regex::new("bar").unwrap();
    let re3 = Regex::new("baz").unwrap();

    let mut string = "foo".to_string();
    string = re1.replace(&string, "bar").into_owned();
    string = re2.replace(&string, "baz").into_owned();
    string = re3.replace(&string, "qux").into_owned();

    println!("{}", string);
}

#2

I’d probably use shadowing instead of calling into_owned():

let string = "foo".to_string();
let string = re1.replace(&string, "bar");
let string = re2.replace(&string, "baz");
let string = re3.replace(&string, "qux");

In this particular example, you’re going to get allocations no matter what since there’s replacement in each case - in that case, into_owned() is a nop because the Cow already has an owned string. However, if you had replacements that did not fire, then this would cause unnecessary allocations.


#3

The only way to answer this question is with a benchmark on real data. There are too many variables, such as the nature of your regexes, the number of them and the size of your corpus. For example:

  • If you only need to search and replace literals, you don’t need regexes at all, and can do a single pass over your corpus using aho-corasick's find method. You’ll need to roll your own replace method, but if you look at the implementation in regex, you’ll find that it isn’t that hard and isn’t much code.
  • If you have actual regexes and not just literals, and they all have non-overlapping matches whose replacements don’t influence subsequent replacement behavior, then you can join all of your regexes together, e.g., (re1)|(re2)|(re3)|...|(reN) and do a single replace_all. Each replacement will cost O(n) to discover which group matched.

There maybe other approaches, but if you’re concerned about performance, then you really need to describe what you’re actual workload is.


#4

That’s exactly what I was worried about, thanks for making that clear! Cow is a new beast to me, so I’m still in the process of getting the hang of it :slight_smile: Shadowing seems like the way to go then.

One small follow-up question though, if I may: in the general case, some of these replacements can be conditional. Does the following look about right for achieving that, or is there perhaps some more terse/idiomatic syntax (avoiding the else { string } part)?

let string = "foo".to_string();
let string = re1.replace(&string, "bar");
// insert sensible condition below
let string = if true {
    re2.replace(&string, "baz")
} else {
    string
};
let string = re3.replace(&string, "qux");

#5

Sorry, I should have made it clearer that I’m not interested in performance on a particular data set, but simply in how to make this particular approach as efficient as possible :slight_smile: You’re absolutely right that for some tasks on some types of data, there will be more efficient approaches than cascading regexes.

But sometimes, it’s not worth it to look into these optimizations and it’s quicker to use a regex cascade as a simple, readable and quite general solution that immediately springs to mind. And given this constraint, I’m interested in how to implement it as efficiently as possible (especially avoiding unnecessary allocations), as shown in @vitalyd’s reply .

Thanks for the link and suggestions though, food for thought!


#6

Yeah, it looks right to me.


#7

OK, one last (hopefully) clarification: if I need to put multiple replacements within a conditional scope, I suppose there’s no other way than to declare all the necessary intermediary bindings in the surrounding scope, is there? I.e. this:

extern crate regex;

use regex::{Regex};

fn main() {
    let re1 = Regex::new("foo").unwrap();
    let re2 = Regex::new("bar").unwrap();
    let re3 = Regex::new("baz").unwrap();
    let re4 = Regex::new("qux").unwrap();

    let string = "foo".to_string();
    let string = re1.replace(&string, "bar");
    // if dummy1 and dummy2 are not declared here,
    // they don't live long enough
    let (dummy1, dummy2);
    let string = if true {
        dummy1 = re2.replace(&string, "baz");
        dummy2 = re3.replace(&dummy1, "qux");
        re4.replace(&dummy2, "FOO!")
    } else {
        string
    };

    println!("{}", string);
}

#8

Yeah, that’s right. If dummy1 actually causes a replacement, then it becomes an owned variant; dummy2 may then reference it (if no replacement happens for dummy2). In turn, the re4 replacement may reference dummy2, and you cannot return out of that scope because dummy1 will get dropped while still borrowed.


#9

Great, thanks so much for all your help and advice!