Fast removing chars from string

#1
  1. My current code looks something like this:

        let s = s.replace("(", "");
        let s = s.replace(")", "");
        let s = s.replace(",", "");
        let s = s.replace("\"", "");
        let s = s.replace(".", "");
        let s = s.replace(";", "");
        let s = s.replace(":", "");
        let s = s.replace("\'", "");
  1. This code is executed for every word/token in a 30GB archive. As a result, speed is important. What is the fastest way (without unreasonable code complexity) to do this?
#2
let s = s.replace(&['(', ')', ',', '\"', '.', ';', ':', '\''][..], "");
4 Likes
#3

@DanielKeep beat me to it, but here’s the link to why


Explanation:
str (and T where T: Deref<str>) has a function named replace (Which you have used already) and its argument for the pattern to look for it is P: Pattern<'a>, and looking at what it’s been implemented for, it seems that &'a [char] would fit the problem at hand. (Look at the description)

2 Likes
#4

(By photo-finish and an untrimmed beard.)

1 Like
#5

How often do these occur in the string? If it’s a lot (many replacements per string -> many moves to shuffle all the later chars backwards), consider something like string.chars().filter(..).collect(). Benchmark, but you might find this one-pass model is worth considering.

Personally, I’d just use Regex until it became more than two problems

[Edit: or even better, :arrow_down:]

2 Likes
#6

You could do it in place (no allocation) like:

s.retain(|c| !r#"(),".;:'"#.contains(c));
7 Likes
#7

And effectively @cuviper’s method is the fastest, almost twice as fast in a crude benchmark.
Playground

2 Likes
#8

That is true for Debug builds, if you try Release the replace() method seems to be fastest (however the difference between them in Release is much much smaller)

Debug:

Replace 593.80961ms
Filter 540.27808ms
Retain 238.693095ms

Release:

Replace 4.68341ms
Filter 6.784266ms
Retain 16.817564ms
5 Likes
#9

Huh. Cool mystery

The in-place version is dirtying the same pages its reading from, so cache contention mumble mumble?

1 Like
#10

Issue resolved. Thanks @DanielKeep , @OptimisticPeach , @dcarosone , @cuviper , @vishusandy for the comprehensive + exhaustive suggestions!

#11

What is retain() doing under the hood that is different from chars().filter().collect()? I feel like the retain src code makes sense but can’t figure out what filter does under the surface.

#12

There’s probably some overhead moving between char and UTF-8 bytes to scan my string. You can change that to the same predicate as the filter example:

data.retain(|x| {!['(', ')', ',', '\"', '.', ';', ':', '\''].contains(&x)});

It’s still a little slower for some reason, but closer:

Replace 6.274342ms
Filter 7.116179ms
Retain 8.411801ms

edit: the benchmark is probably too crude. I just ran it again and got:

4.449713ms
6.453952ms
5.984093ms
5 Likes