Chars().count() return different value depending on normalization

chars().count() is supposed to return the number of characters in a Unicode string.
I found out that depending on unicode normalization forms, it is not always the case.

See Rust Playground
I get accents as separate characters.

Is this on purpose ? (if so why??)
How do I always get a consistent count of 8 characters for the "kérosène" string ?

1 Like

Yes, it's on purpose. chars returns Unicode codepoints. Accents can be encoded as separate codepoints.

You may want the unicode-segmentation crate.

5 Likes

Because that's the only thing you can provide on many OSes without pile of tables that would turn a “Hello, World” program into multi-megabyte binary.

Rust have decided not to go that route, while Swift, e.g., decided that since iOS provides needed services in OS, itself, that's a sensible choice (and produces huge binaries on GNU/Linux, e.g.)

1 Like

Consider the string "É", in the abstract.

It has two representations in Unicode:

  • "\u{00c9}", using the Unicode character Latin Capital Letter E with Acute (U+00C9), and
  • "\u{0065}\u{0301}", using the Unicode characters Latin Capital Letter E, and Combining Acute Accent.

These two strings both contain a single grapheme cluster - an E with an accent - but they do not contain the same number of characters. chars() gives you character count, and the second string contains two.

Normalizing either string under NFC will give you the single-character version. Normalizing either string under NFD will give you the two-character version. That's the point of normalization - to remove some of the redundancy built into Unicode and make sure that all equivalent strings encode equally.

If you want to get a grapheme cluster count - which is what you probably want when asking for a function foo such that foo("kérosène") == 8 regardless of normalization - then you need to use external libraries, as this isn't built into Rust. I use this one for this purpose. If you don't want to use a separate library, you can approximate grapheme cluster count by counting characters under NFC, with the caveat that graphemes with no single-character representation will result in a count larger than the number of actual grapheme clusters in the string.

3 Likes

Thanks all.
It is even clearly stated in the documentation. char - Rust

For my purpose I will use unicode-normalization to convert strings to the NFC form.

1 Like

FYI, it’s also stated in the documentation of .chars():

Returns an iterator over the chars of a string slice.

As a string slice consists of valid UTF-8, we can iterate through a string slice by char. This method returns such an iterator.

It’s important to remember that char represents a Unicode Scalar Value, and might not match your idea of what a ‘character’ is. Iteration over grapheme clusters may be what you actually want. This functionality is not provided by Rust’s standard library, check crates.io instead.

3 Likes