Chars().count() return different value depending on normalization

rtbo · May 9, 2025, 3:21pm

chars().count() is supposed to return the number of characters in a Unicode string.
I found out that depending on unicode normalization forms, it is not always the case.

See Rust Playground
I get accents as separate characters.

Is this on purpose ? (if so why??)
How do I always get a consistent count of 8 characters for the "kérosène" string ?

tczajka · May 9, 2025, 3:31pm

Yes, it's on purpose. chars returns Unicode codepoints. Accents can be encoded as separate codepoints.

You may want the unicode-segmentation crate.

khimru · May 9, 2025, 7:34pm

Because that's the only thing you can provide on many OSes without pile of tables that would turn a “Hello, World” program into multi-megabyte binary.

Rust have decided not to go that route, while Swift, e.g., decided that since iOS provides needed services in OS, itself, that's a sensible choice (and produces huge binaries on GNU/Linux, e.g.)

derspiny · May 9, 2025, 8:44pm

Consider the string "É", in the abstract.

It has two representations in Unicode:

"\u{00c9}", using the Unicode character Latin Capital Letter E with Acute (U+00C9), and
"\u{0065}\u{0301}", using the Unicode characters Latin Capital Letter E, and Combining Acute Accent.

These two strings both contain a single grapheme cluster - an E with an accent - but they do not contain the same number of characters. chars() gives you character count, and the second string contains two.

Normalizing either string under NFC will give you the single-character version. Normalizing either string under NFD will give you the two-character version. That's the point of normalization - to remove some of the redundancy built into Unicode and make sure that all equivalent strings encode equally.

If you want to get a grapheme cluster count - which is what you probably want when asking for a function foo such that foo("kérosène") == 8 regardless of normalization - then you need to use external libraries, as this isn't built into Rust. I use this one for this purpose. If you don't want to use a separate library, you can approximate grapheme cluster count by counting characters under NFC, with the caveat that graphemes with no single-character representation will result in a count larger than the number of actual grapheme clusters in the string.

rtbo · May 9, 2025, 9:35pm

Thanks all.
It is even clearly stated in the documentation. char - Rust

For my purpose I will use unicode-normalization to convert strings to the NFC form.

steffahn · May 9, 2025, 9:41pm

FYI, it’s also stated in the documentation of .chars():

Returns an iterator over the chars of a string slice.

As a string slice consists of valid UTF-8, we can iterate through a string slice by char. This method returns such an iterator.

It’s important to remember that char represents a Unicode Scalar Value, and might not match your idea of what a ‘character’ is. Iteration over grapheme clusters may be what you actually want. This functionality is not provided by Rust’s standard library, check crates.io instead.

system · August 7, 2025, 9:42pm

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Should you really use `.chars()` for characters in a string? help	3	2143	May 8, 2022
Do there exist unicoded strings where len()/python and len()/rust are different?	8	3216	January 12, 2023
Rewrite string group-by code as idiomatic rust help	3	1951	January 12, 2023
What's everyone working on this week (48/2017)? community	6	1095	January 12, 2023
Is there another way of indexing a String rather than converting it to bytes?	30	1784	November 17, 2020

Chars().count() return different value depending on normalization

Related topics