String.replace performance


#1

Hello rustaceans,

I’m playing with Rust to optimize bits of Racket code that are too slow and in particular, I’m dealing with multiple search/replace operations on text. The performance of Racket being very low on this particular operation, I have written a simple FFI wrapper to perform the search/replace operation using Rust.

And indeed, it is much faster, dropping from around 140ms to 11ms using Rust (on a MB Pro 2.5Ggz i7). However I remain curious: The code performs 161 search/replace on a rather short text fragment (110KB) and I’m suprised it is that slow (I was naively expecting something like lower 1ms). I have also written a version using Clojure and Apache Common-lang StringUtils.replaceEachRepeatedly, but so far It either indicates I’m very very bad at doing Clojure, or that is is much slower (500ms per iteration).

I have uploaded everything needed to test the code on GitHub ( https://github.com/octplane/replace-rust-benchmark ) and would appreciate some feedback from the community on:

  • my naivety around code speed (and probably the fact the Replacing text is much more difficult to do than I thought)
  • the possibility of optimising the code even more.
  • anything enlightening about this.

Thank you for this,
Oct.


#2

You’re making 161 calls to String.replace. Each call has to scan the entire 110KB string for matches, in addition to allocating and building a new result string. That means Rust is processing your text at 161*110KB/11ms = 1.6GB/s, which seems reasonable.

If 11ms is too slow, you’ll want to implement your code so it only makes one pass over the input string, and only builds one output string. You can use Rust’s regex crate to make a regex which matches any one of your search terms, then call its replace_all method with a replacer that looks up the correct substitution for each term you find. This should be much faster, assuming there aren’t a huge number of matches. Note that this could return different results from your current code if matches ever overlap or interact with each other.

(Edit: use Regex.replace_all, not Regex.replace.)


#3

I’d second @yotann’s recommendation.

If your query strings are just literals, then you might consider circumventing regexes altogether and using the aho-corasick crate. Aho-Corasick will give you more control. (I can elaborate on this if you like, there are several trade offs at play, and they are at least partially related to the number of query strings you have.) Also note that the aho-corasick crate doesn’t have any replace functionality built into it, but you could write one yourself relatively easily. Finally, aho-corasick is subject to the same overlapping caveat that @yotann mentioned for regexes, although aho-corasick does have a find_overlapping search, but squaring that with a single pass replacement is a bit tricky, and probably depends on the specific constraints of your problem.


#4

Thanks for your suggestions. Indeed, I never went to compute the actual bandwidth involved. I’ll experiment with both methods as the Regexp idea had floated in my mind. I’m not sure my pattern actually overlap. They shouldn’t too much and if they are, I should find a workaround.

Thanks for pointing me to the aho-corashick crate! I’ll definitely read more stuff about it.


#5

Hello!

I’ve decided to test using the aho-corasick crate with great results. Because I’m not using any regexp, I found this would be probably more elegant than building a giant piped regexp.

running 3 tests
test aho_corasick_replacer ... bench:   1,831,796 ns/iter (+/- 316,943)
test classic_replacer      ... bench:   9,134,042 ns/iter (+/- 1,280,736)
test corasick_finder       ... bench:           1 ns/iter (+/- 0)

FWIW, aho-corasick gives the same result on my test case, which means I have no overlapping replacements.

And finally, for the most curious ones:

fn creplace_(keys: &Vec<&str>, values: &Vec<&str>, target: String) -> String {
    let aut = AcAutomaton::new(keys);
    let mut ms = aut.find(&target);
    let mut target_end = 0;
    let mut o: Vec<String> = vec![]; 

    loop {
        match ms.next() {
            Some(x) => {
            o.push(target[target_end..x.start].into());
            o.push(values[x.pati].into());
            target_end = x.end;
        },
        None => { break }
        }
    }
    o.push(target[target_end..].into());
    o.join("")
}

Can we make this code even faster? Probably by giving up some copies. My rust is rusty so I’ll probably keep this code unless some smart fellow has a proposition.

Rest of the code and benchmarkable code lives at https://github.com/octplane/replace-rust-benchmark .

Thanks for your assistance!
Oct.


#6

You can probably get a nice speed boost at the cost of memory usage if search time dwarfs automaton construction by using let aut = AcAutomaton::new(keys).into_full().

Other than that, the big change I’d make is to push into one giant String instead of a Vec<String>. That will avoid needing to join at the end and will also avoid the intermediate String copies. For example, you can replace o.push(values[x.pati].into()) with o.push_str(&values[x.pati]). Similarly for the other o.push calls.


#7

Also, you probably want to take target: &str instead of target: String. (For example, right now, your benchmark includes the cost of an additional copy of the entire target string.)


#8

Thanks for all the ideas:

  • no dwarfing around the search time vs. automaton construction, unless I create additional memory to store the automaton between FFI calls.
  • refactoring to use a large String instead of a Vec and using and &str does not too much:
test benchs::aho_corasick_replacer ... bench:   1,810,289 ns/iter (+/- 389,654)

Benchmark runs too fast to be precise enough. However, code is more elegant and has been integrated with its Racket parent server :slight_smile:

[RUST] Prepared data in 0ms <-- this is the measured time inside the Rust C interop code.
[Fri, 7 Oct 2016 22:45:02 +0200] mutation took 2.14013671875ms  <-- this is the time measured by Racket outside the FFI.

Here we see the FFI cost that is now emerging from the Racket call. Before I started this optimization, the whole thing was taking around 160ms.

thanks!