Use &[u8] as &[char] assuming ASCII?

Currently I am trying to optimize skim (which is a fuzzy finder) for large chunks of inputs.

One of the bottlenecks is the conversion from String to Vec<char>. When consuming around 500M input(with around 5M lines) the conversion process takes around 50% of the time (3.8s out of 6.5s).

The code I use:

let chars: Vec<char> = text.chars().collect();

Not sure if it is caused by chars() that is slow for checking UTF8 characters or collect() that allocates additional memories.

Thus one optimization could be checking if the input bytes are all ASCII characters and reuse the original string as Vec<char> to avoid allocation. For example:

let chars = if text.as_bytes().is_ascii() {
    // how to reuse string as Vec<char> without calling `.chars()`?
} else {
    text.chars().collect()
};

So the problem is how could I cast a &[u8] to &[char] without additional allocation?

Thanks in advance!

char is 32 bits wide, so the memory layout of &[u8] isn’t compatible with &[char].

In general, it’s probably faster to operate on the incoming bytes directly rather than converting to Vec<char>. Hard to say more without knowing more about the application, though.

1 Like

Can your code work over iterator over chars? If yes, you can do something like this:

fn do_stuff(d: impl Iterator<Item=impl Into<char>>) {
    // ..
}

fn foo(text: &str) {
    if text.as_bytes().is_ascii() {
        do_stuff(text.as_bytes().iter().cloned())
    } else {
        do_stuff(text.chars())
    };
}

Given a pattern "rt", the application will try to (fuzzy) match the input "rust". And since the input may contain UTF8 characters the matching function is expecting &[char].

@comex I see that &[u8] could not be used directly as &[char]. I’ll try to create some wrappers over it.

@newpavlov Unfortunately iterators won’t work in my case since the matching function will access characters by index. But still, great idea!

text.chars().collect()

what if you try to use preallocation?

let chars = text.chars();
let mut v: Vec<char> = Vec::with_capasity(chars.len());
v = chars.collect();

But I’m thinking collect does the same.

Consider treating both the pattern and input as &[u8], in the case where the pattern is all ASCII. Multi-byte codepoints in UTF-8 will never contain any ASCII bytes, so there’s no risk of an ASCII pattern incorrectly matching a non-ASCII character.

However, this might require writing a slightly different version of the matching algorithm for this case.

2 Likes

Does that even work? You are overwriting the v variable with another instance of a vec.

:man_facepalming: I was in sleeping mode
Also


let mut v = Vec::with_capacity(text.len());
for c in text.chars() {
    v.push(c);
}

Or even easier

let mut v = vec![];
v.extend(text.chars());

You could reserve the exact len, as virtuos86 did, but in the code for extend, there is this line:

if len == self.capacity() {
    let (lower, _) = iterator.size_hint();
    self.reserve(lower.saturating_add(1));
}

which means, that it will extend itself, based on the size_hint, which should give the correct amount of chars back

Speaking as someone who works in this area quite a bit, I’d like to repeat @mbrubeck’s advice: you almost certainly do not want to be working with a Vec<char>. Instead, you should work with a &str or possibly even a &[u8]. (I kind of imagine you need to do the latter anyway if you’re intent is to search things like file paths, which aren’t guaranteed to be valid UTF-8.)

If you need the ability to slice codepoints, then you should report byte indices at valid UTF-8 boundaries, and then you can slice a &str or a &[u8] in constant time. This is how the regex library works.

If you have specific questions about why this strategy is problematic, I’d be happy to try and answer them.

7 Likes

Seems that collect had reserve the capacity with iterator.size_hint() right? And my rough test showed that it did not make much difference.

1 Like