Currently I am trying to optimize skim (which is a fuzzy finder) for large chunks of inputs.
One of the bottlenecks is the conversion from String to Vec<char>. When consuming around 500M input(with around 5M lines) the conversion process takes around 50% of the time (3.8s out of 6.5s).
The code I use:
let chars: Vec<char> = text.chars().collect();
Not sure if it is caused by chars() that is slow for checking UTF8 characters or collect() that allocates additional memories.
Thus one optimization could be checking if the input bytes are all ASCII characters and reuse the original string as Vec<char> to avoid allocation. For example:
let chars = if text.as_bytes().is_ascii() {
// how to reuse string as Vec<char> without calling `.chars()`?
} else {
text.chars().collect()
};
So the problem is how could I cast a &[u8] to &[char] without additional allocation?
char is 32 bits wide, so the memory layout of &[u8] isn't compatible with &[char].
In general, it's probably faster to operate on the incoming bytes directly rather than converting to Vec<char>. Hard to say more without knowing more about the application, though.
Given a pattern "rt", the application will try to (fuzzy) match the input "rust". And since the input may contain UTF8 characters the matching function is expecting &[char].
@comex I see that &[u8] could not be used directly as &[char]. I'll try to create some wrappers over it.
@newpavlov Unfortunately iterators won't work in my case since the matching function will access characters by index. But still, great idea!
Consider treating both the pattern and input as &[u8], in the case where the pattern is all ASCII. Multi-byte codepoints in UTF-8 will never contain any ASCII bytes, so there's no risk of an ASCII pattern incorrectly matching a non-ASCII character.
However, this might require writing a slightly different version of the matching algorithm for this case.
Speaking as someone who works in this area quite a bit, I'd like to repeat @mbrubeck's advice: you almost certainly do not want to be working with a Vec<char>. Instead, you should work with a &str or possibly even a &[u8]. (I kind of imagine you need to do the latter anyway if you're intent is to search things like file paths, which aren't guaranteed to be valid UTF-8.)
If you need the ability to slice codepoints, then you should report byte indices at valid UTF-8 boundaries, and then you can slice a &str or a &[u8] in constant time. This is how the regex library works.
If you have specific questions about why this strategy is problematic, I'd be happy to try and answer them.