Use &[u8] as &[char] assuming ASCII?

lotabout · September 2, 2018, 12:45am

Currently I am trying to optimize skim (which is a fuzzy finder) for large chunks of inputs.

One of the bottlenecks is the conversion from String to Vec<char>. When consuming around 500M input(with around 5M lines) the conversion process takes around 50% of the time (3.8s out of 6.5s).

The code I use:

let chars: Vec<char> = text.chars().collect();

Not sure if it is caused by chars() that is slow for checking UTF8 characters or collect() that allocates additional memories.

Thus one optimization could be checking if the input bytes are all ASCII characters and reuse the original string as Vec<char> to avoid allocation. For example:

let chars = if text.as_bytes().is_ascii() {
    // how to reuse string as Vec<char> without calling `.chars()`?
} else {
    text.chars().collect()
};

So the problem is how could I cast a &[u8] to &[char] without additional allocation?

Thanks in advance!

comex · September 2, 2018, 1:23am

char is 32 bits wide, so the memory layout of &[u8] isn't compatible with &[char].

In general, it's probably faster to operate on the incoming bytes directly rather than converting to Vec<char>. Hard to say more without knowing more about the application, though.

newpavlov · September 2, 2018, 2:01am

Can your code work over iterator over chars? If yes, you can do something like this:

fn do_stuff(d: impl Iterator<Item=impl Into<char>>) {
    // ..
}

fn foo(text: &str) {
    if text.as_bytes().is_ascii() {
        do_stuff(text.as_bytes().iter().cloned())
    } else {
        do_stuff(text.chars())
    };
}

lotabout · September 2, 2018, 12:49pm

Given a pattern "rt", the application will try to (fuzzy) match the input "rust". And since the input may contain UTF8 characters the matching function is expecting &[char].

@comex I see that &[u8] could not be used directly as &[char]. I'll try to create some wrappers over it.

@newpavlov Unfortunately iterators won't work in my case since the matching function will access characters by index. But still, great idea!

virtuos86 · September 3, 2018, 2:34am

text.chars().collect()

what if you try to use preallocation?

let chars = text.chars();
let mut v: Vec<char> = Vec::with_capasity(chars.len());
v = chars.collect();

But I'm thinking collect does the same.

mbrubeck · September 3, 2018, 3:19am

Consider treating both the pattern and input as &[u8], in the case where the pattern is all ASCII. Multi-byte codepoints in UTF-8 will never contain any ASCII bytes, so there's no risk of an ASCII pattern incorrectly matching a non-ASCII character.

However, this might require writing a slightly different version of the matching algorithm for this case.

hellow · September 3, 2018, 12:11pm

Does that even work? You are overwriting the v variable with another instance of a vec.

virtuos86 · September 4, 2018, 2:47am

I was in sleeping mode
Also


let mut v = Vec::with_capacity(text.len());
for c in text.chars() {
    v.push(c);
}

hellow · September 4, 2018, 6:17am

Or even easier

let mut v = vec![];
v.extend(text.chars());

You could reserve the exact len, as virtuos86 did, but in the code for extend, there is this line:

if len == self.capacity() {
    let (lower, _) = iterator.size_hint();
    self.reserve(lower.saturating_add(1));
}

which means, that it will extend itself, based on the size_hint, which should give the correct amount of chars back

BurntSushi · September 4, 2018, 1:44pm

Speaking as someone who works in this area quite a bit, I'd like to repeat @mbrubeck's advice: you almost certainly do not want to be working with a Vec<char>. Instead, you should work with a &str or possibly even a &[u8]. (I kind of imagine you need to do the latter anyway if you're intent is to search things like file paths, which aren't guaranteed to be valid UTF-8.)

If you need the ability to slice codepoints, then you should report byte indices at valid UTF-8 boundaries, and then you can slice a &str or a &[u8] in constant time. This is how the regex library works.

If you have specific questions about why this strategy is problematic, I'd be happy to try and answer them.

lotabout · September 4, 2018, 2:07pm

Seems that collect had reserve the capacity with iterator.size_hint() right? And my rough test showed that it did not make much difference.

Topic		Replies	Views
Is there a missing API for optimal &[u8] -> Result<String> conversion?	10	868	November 26, 2019
String implementation community	8	858	September 21, 2022
Iterating over chars from readers help	5	860	January 12, 2023
Is there another way of indexing a String rather than converting it to bytes?	30	1173	November 17, 2020
Builtin for Vec<u8> to String	5	1950	July 16, 2020

Use &[u8] as &[char] assuming ASCII?

Related Topics