Improving the performance of this splitting code

I'm porting an application from python to rust. It had some regex in it: ([{}\[\]<>|=&'#*;:/\\\"\-!\n]). This converts to the following rust code when hand rolled:

pub fn split_wikitext(text: &str) -> Vec<&str> {
    let mut pieces = Vec::new();
    let mut last_end = 0;
    const SPLITS: [char; 20] = [
        '{', '}', '[', ']', '<', '>', '|', '=', '&', '\'', '#', '*', ';', ':',
        '/', '\\', '"', '-', '!', '\n',
    ];
    for (i, ch) in text.char_indices() {
        if SPLITS.contains(&ch) {
            if last_end != i {
                pieces.push(&text[last_end..i]);
            }
            let ch_end = i + ch.len_utf8();
            pieces.push(&text[i..ch_end]);
            last_end = ch_end;
        }
    }
    if last_end < text.len() {
        pieces.push(&text[last_end..]);
    }
    pieces
}

python has slightly different regex capture semantics than the regex crate, also I don't care about the existence of empty strings in the output vec, although I'd prefer not having to do a pass to eliminate them. Profiling shows that this is one of the hotspots in my code (216 samples are spent on this function, about 10% of execution time, 11 samples are spend on RawVec::grow_one, probably from all the pushing), mainly because I'm throwing massive strings at this function and it has to go through every character one at a time. Is there a more optimal way to do this?

Anyhow I'm wondering if I'm doing this in the most optimal manner.

First, I don’t think the last_end..i case is needed; last_end..ch_end would cover both the case where last_end != i and last_end == i. That could at least reduce the amount of pushing, and maybe having one fewer branch in the loop would help. Edit: ugh, nevermind, I’m being silly. The entire point is to split, for some reason I was thinking of String::push_str and not Vec::push.

One optimization I can think of is using CharIndices::offset instead of char::len_utf8. Additionally, I wondered if for_each would be better than repeatedly calling next, but CharIndices uses the default implementation of Iterator::for_each, so there’s probably no difference.

Are you not allowed to use a function of the standard library? Because this is typically covered by std. For Rust the following function seems close to what you need:

str - Rust

I just tried it out in Rust Playground. The behavior is different. In particular, OP wants to keep the separators.

From the provided examples, it seems to be at least similar, and when a solution with utmost performance is required, it might make sense to study the source code of the std lib example. But of course finding a fine solution ourself is also an interesting exercise. And comparing performance to a solution with regex lib might be also interesting.

My first idea for optimizations was to avoid the linear search for split characters in

if SPLITS.contains(&ch) {

Nim used typically a hash set for such, and Claude AI actually also recommends to optimize this linear search, they suggest use of matches!(ch, which might be fast.

Yes. This is part of a tokenizer, and those separators are needed.

And comparing performance to a solution with regex lib might be also interesting.

This current solution is already an order of magnitude faster than regex, so it's already quite an improvement.

For whatever it’s worth, in Rust I’d prefer a Token enum over only string slices. If the tokenizer does parse things into an enum, it would likely be easier to match the separators immediately.

Source: wikitext · wikicode-parser · GalStar / mwbot · GitLab

Sadly I didn't have enough foresight to use an enum, so instead I use an enum and a weird dynamically typed hashmap. For some reason the performance hit from that isn't terrible. But at the same time any given separator can have up to 3 or 4 different TokenTypes that it refers to, so I can't really handle them.

That is interesting. I can remember a different case a few months ago, where I guesses that a regex solution would be slow, and it was actually close to hand written code.

So I would suggest to try the match!() solution instead of sequentially searching if a char is a split. Or to try the other solution suggested by Claude, using the

// Alternative: If you need maximum performance and only ASCII split chars
pub fn split_wikitext_fast(text: &str) -> Vec<&str> {
    // Lookup table for ASCII characters (0-127)
    static LOOKUP: [bool; 128] = {
        let mut table = [false; 128];
        table[b'{' as usize] = true;

approach. Sorry, I have no idea how this actually compares to match!().

Here's a couple split_inclusive based alternatives which I did not benchmark.

3 Likes

One very easy change you can make here is to pre-reserve capacity in the Vec based on the approximate number of entries you expect.

let mut pieces = Vec::with_capacity(text.len() / 2);

When growing a Vec one element at a time, the storage will periodically need to be re-allocated and the existing content copied into the new storage. Reserving capacity up-front avoids or reduces this.

In a quick benchmark using the large.txt file this reduced the time to parse this file 100 times from ~240ms to ~200ms.

Since all the split characters are ASCII, ChatGPT suggested a trick (with some corrections from me) where you form a u128 mask with all split character positions set to 1, and then use the actual character value to shift the mask so that the corresponding bit ends up at the LSB which can be tested against 0. This further reduced the runtime down to ~170ms.

fn is_split_byte(b: u8) -> bool {
    // Bits 0..=127 correspond to ASCII bytes.
    const MASK: u128 = (1u128 << b'{')
        | (1u128 << b'}')
        | (1u128 << b'[')
        | (1u128 << b']')
        | (1u128 << b'<')
        | (1u128 << b'>')
        | (1u128 << b'|')
        | (1u128 << b'=')
        | (1u128 << b'&')
        | (1u128 << b'\'')
        | (1u128 << b'#')
        | (1u128 << b'*')
        | (1u128 << b';')
        | (1u128 << b':')
        | (1u128 << b'/')
        | (1u128 << b'\\')
        | (1u128 << b'"')
        | (1u128 << b'-')
        | (1u128 << b'!')
        | (1u128 << b'\n');

    // Membership test
    ((MASK >> b) & 1) != 0
}

fn split_wikitext_mask(text: &str) -> Vec<&str> {
    let mut pieces = Vec::with_capacity(text.len() / 2);
    let mut last_end = 0;
    for (i, ch) in text.char_indices() {
        if (ch as u32) < 128 && is_split_byte(ch as u8) {
            if last_end != i {
                pieces.push(&text[last_end..i]);
            }
            let ch_end = i + ch.len_utf8();
            pieces.push(&text[i..ch_end]);
            last_end = ch_end;
        }
    }
    if last_end < text.len() {
        pieces.push(&text[last_end..]);
    }
    pieces
}
3 Likes

I benched everything with criterion (rustc 1.89.0, M4 Pro):

File: large.txt, Length: 1346265 characters
Percentage of special characters: 18.32%


File: longest_article.txt, Length: 591209 characters
Percentage of special characters: 12.01%

File: main_page.txt, Length: 3029 characters
Percentage of special characters: 26.18%

Text: 43900 characters, no special characters

Mask seems to be almost consistently more optimal (large.txt seems to be close, but that is probably statistical noise).

1 Like

There is another easy allocation-related win I forgot to mention. If you are calling this function many times with different inputs, reuse the output buffer across iterations (by clearing it with Vec::clear) rather than allocating a new Vec each time. On macOS, all memory gets zeroed when deallocated and there is overhead to doing this (source).

On the comparison I posted above, this reduced the time for 100 scans of large.txt from ~170ms to ~160ms.

1 Like

I had no actual idea how fast the matches!() macro actually is compared to the use of contains, or compared to the use of the extern bitvec crate. So I just asked GPT5 to create a criterion benchmark for these. I have not yet studied the created code, I will do it this evening. But I think the output is quite interesting. Performance of contains and matches! are very close, the LUT with bool elements seems to be a bit faster than the use of a bit test in an 128 bit word, and the bitvec crate seems to be the fastest. Actually I was told that bitvec is very fast in release mode, but it took me some time to remember that crate name.

stefan@hx90 /tmp/split-wikitext-bench $ cargo bench
   Compiling serde v1.0.228
   Compiling split-wikitext-bench v0.1.0 (/tmp/split-wikitext-bench)
   Compiling ciborium v0.2.2
   Compiling tinytemplate v1.2.1
   Compiling criterion v0.5.1
    Finished `bench` profile [optimized] target(s) in 6.47s
     Running unittests src/lib.rs (target/release/deps/split_wikitext_bench-5bad593f450eb448)

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running benches/split.rs (target/release/deps/split-9c6094a7c24e53a7)
Gnuplot not found, using plotters backend
contains/reps_4         time:   [3.4161 µs 3.4510 µs 3.5001 µs]
contains/reps_32        time:   [29.342 µs 29.389 µs 29.444 µs]
Found 3 outliers among 60 measurements (5.00%)
  3 (5.00%) high mild
contains/reps_256       time:   [199.19 µs 199.35 µs 199.52 µs]
Found 1 outliers among 60 measurements (1.67%)
  1 (1.67%) high severe
contains/reps_1024      time:   [974.82 µs 976.52 µs 978.40 µs]
Found 1 outliers among 60 measurements (1.67%)
  1 (1.67%) high mild

matches/reps_4          time:   [3.4842 µs 3.4926 µs 3.5007 µs]
Found 6 outliers among 60 measurements (10.00%)
  4 (6.67%) high mild
  2 (3.33%) high severe
matches/reps_32         time:   [31.767 µs 31.938 µs 32.171 µs]
matches/reps_256        time:   [213.51 µs 214.80 µs 216.76 µs]
Found 9 outliers among 60 measurements (15.00%)
  9 (15.00%) high severe
matches/reps_1024       time:   [956.44 µs 965.68 µs 972.20 µs]

ascii_lut/reps_4        time:   [2.3515 µs 2.3554 µs 2.3613 µs]
Found 2 outliers among 60 measurements (3.33%)
  2 (3.33%) high severe
ascii_lut/reps_32       time:   [16.575 µs 16.636 µs 16.685 µs]
ascii_lut/reps_256      time:   [133.75 µs 134.08 µs 134.34 µs]
Found 1 outliers among 60 measurements (1.67%)
  1 (1.67%) high mild
ascii_lut/reps_1024     time:   [554.31 µs 555.68 µs 557.53 µs]
Found 10 outliers among 60 measurements (16.67%)
  4 (6.67%) low mild
  6 (10.00%) high mild

u128_mask/reps_4        time:   [2.5915 µs 2.5933 µs 2.5951 µs]
Found 4 outliers among 60 measurements (6.67%)
  3 (5.00%) high mild
  1 (1.67%) high severe
u128_mask/reps_32       time:   [18.538 µs 18.560 µs 18.579 µs]
Found 3 outliers among 60 measurements (5.00%)
  2 (3.33%) high mild
  1 (1.67%) high severe
u128_mask/reps_256      time:   [149.40 µs 149.61 µs 149.81 µs]
Found 1 outliers among 60 measurements (1.67%)
  1 (1.67%) high mild
u128_mask/reps_1024     time:   [616.70 µs 618.38 µs 620.58 µs]
Found 10 outliers among 60 measurements (16.67%)
  6 (10.00%) low mild
  4 (6.67%) high mild

bitvec/reps_4           time:   [1.9632 µs 1.9655 µs 1.9683 µs]
Found 2 outliers among 60 measurements (3.33%)
  2 (3.33%) high mild
bitvec/reps_32          time:   [13.481 µs 13.533 µs 13.576 µs]
bitvec/reps_256         time:   [108.33 µs 108.81 µs 109.20 µs]
Found 1 outliers among 60 measurements (1.67%)
  1 (1.67%) high mild
bitvec/reps_1024        time:   [456.89 µs 458.43 µs 460.36 µs]
Found 5 outliers among 60 measurements (8.33%)
  4 (6.67%) low mild
  1 (1.67%) high mild

This is genius!

1 Like