Fastest way to count number of characters beginning a string

Hi there!

I'm currently making a program where at one point I need to count the number of whitespaces at the beginning of a string. Sometimes it's Unicode whitespaces which can be checked with char::is_whitespace, sometimes I also have to ensure it's not an \n.

Here is what I came up with:

let mut len = 0;
let mut chars = input.source.chars().peekable();

while matches!(chars.peek(), Some(c) if c.is_whitespace() && *c != '\n')
{
    len += chars.next().unwrap().len_utf8();
}

This is not a very elegant solution and also not very performant as I use .peek() in addition to .next(). I also tried:

let mut len = 0;
let mut chars = input.source.chars();

loop {
    match chars.next() {
        Some(c) => {
            if c.is_whitespace() && c != '\n' {
                len += c.len_utf8();
            } else {
                break;
            }
        }

        None => break,
    }
}

This will be a bit faster but still not great.

So, what is the fastest (and ideally elegant) way to solve this problem?

Thanks in advance for your help!

I don't know about fastest (How are you measuring that? Are you running in release mode, etc?), but the cleanest way I can think of is:

fn count_whitespace_bytes_at_start(input: &str) -> usize {
    input
        .chars()
        .take_while(|ch| ch.is_whitespace() && *ch != '\n')
        .map(|ch| ch.len_utf8())
        .sum()
}

There's no reason to use peekable here, as chars are Copy.

EDIT: Whoops, yes, as @kpreid said, this is counting bytes not characters - I just assumed that was what you wanted based on the existing code, forgot about the title :sweat_smile: Edited my answer to make that a bit clearer in case other people look at this thread later.

4 Likes

By using len_utf8() here, you're not counting characters, you're counting bytes. If you want to do that, then you can use char_indices() to find the index of the first non-whitespace.

fn count_whitespace_at_start(input: &str) -> usize {
    input
        .char_indices()
        .find(|(_, ch)| !(ch.is_whitespace() && *ch != '\n'))
        .map(|(i, _)| i)
        .unwrap_or_else(|| input.len())
}

Unfortunately, this isn't any shorter, but it might be faster (I haven't checked).

If you actually want characters and not bytes, then

    input
        .chars()
        .take_while(|ch| ch.is_whitespace() && *ch != '\n')
        .count()

should do.

2 Likes

That's exactly what I was looking for, thanks!

For performance comparison I just look at how the algorithm works and if it's simple enough I directly look at the generated assembly code.

Given the way Rust iterators work this should be fast, I'll check this but this definitely seems like an ideal solution!

1 Like

Sorry, the title was indeed confusing, I meant to count the number of bytes which these characters take, in order to slice the input directly without using an iterator. Hence the .len_utf8().

You might want to test how this loop compares to the performance you get with regex. It has the downside that you need to count the characters separately, but finding the end of the initial string of whitespaces might be quite fast.

(Mostly I’m curious where regexes stand here performance-wise, I wouldn’t really be convinced that they might actually be faster in this case.)

Here’s a playground with some demo of how it could work. Rust Playground. In case you measure this implementation, make sure you don't necessarily count the initialization of the Regex, e.g. by calling Lazy::force(&WHITESP_NO_NEWLINE) at some earlier place.

Thanks, I can definitely check this out! :slight_smile:

But regex will necessarily be slower algorithmically as it needs to do the exact same amount of work (iterating through the string and counting the whitespace characters) + some other work (interpreting the regex, creating a result object from it).

The only reason it may be faster is if it uses some specific kind of CPU optimizations like SIMD (which is perfectly possible). So I'll check if it's the case or not :wink:

Disclaimer: I haven't measured anything.

Regex does have SIMD stuff in it, but generally, it's going to help you a lot more in terms of throughput than it will latency. So if you're testing on short strings (and that represents your actual workload), then a regex is unlikely to improve matters.

Usually "interpreting the regex" is a one-time cost that is amortized away. Not always, but usually.

2 Likes

Another really interesting potential approach (which will only start paying off on long strings) would be to find the index of the first non-whitespace character (with whatever strategy) and then use bytecount::num_chars.

Basically, bytecount should be the fastest way to count the number of chars in a slice. If you can find the index of the first undesired char meaningfully faster without doing full UTF-8 decoding, counting the chars in a second pass could be faster.

However, with an inclusion set as mixed as the Unicode \p{WHITESPACE} set, I doubt it to be reasonably possible, and would likely have to be handwritten SIMD-compatible computation reoptimized every time the set gets updated. Or in simpler words: not really worth it.

1 Like

Even if it was a simple character, you still need to iterate over the string to find the first non-matching one. And given you have to iterate just incrementing a counter doing so will have the lowest possible cost in performance as far as I know.

The performance problem is to find the first non-whitespace character. Which can't be done in faster way than the way given by @17cupsofcoffee I think.

Which is honestly fast enough, to give the context I'm currently making a parser generator library and as most programs in most languages contain a lot of spaces, this parser needs to be extremely fast.

You could consider using logos for the lexer; it handles tokenization faster than anything else I've seen and is imho the mark to beat.

2 Likes

I didn't know about that, very interesting!

Thank you for the link, I'll check it out :slight_smile:

One thing to watch out for with regex is that by default it comes with about 1mb of Unicode data that will be compiled into your binary: regex - Rust

1 Like

Thanks I'll keep that in mind. Although I'm not worried about binary size as the final program will contain a parser + typechecker + runtime + compiler + debugger, so it will be pretty heavy even with 1 MB less ^^"