Should you really use `.chars()` for characters in a string?

jolson · February 7, 2022, 1:10am

It is well known that .len() does not tell you the number of characters in a string, but rather the number of bytes (or should I say code units?) it uses in UTF-8 encoding. For counting characters, .chars().count() is recommended instead.

However, .chars() iterates over Unicode scalar values, of which it might take several to make up a single character. It seems that .graphemes(true) (from the crate unicode-segmentation), which iterates over grapheme clusters, better corresponds to what you'd think of as characters.

Questions:

Isn't .graphemes(true) what you should typically use instead of .chars()?
When do you actually need .chars() except when working with Unicode particulars?
Should .chars() really have been about grapheme clusters all along?

use unicode_segmentation::UnicodeSegmentation;

fn main() {
    let s = concat!(
        "a",        // U+0061 LATIN SMALL LETTER A 
        "\u{030a}"  // U+030A COMBINING RING ABOVE
    );

    // A grapheme cluster is essentially a character.
    // <https://www.unicode.org/glossary/#grapheme_cluster>
    println!(
        "{} grapheme cluster: {}",
        s.graphemes(true).count(),
        s
    );

    // A Unicode scalar value is essentially a Unicode code point.
    // <https://www.unicode.org/glossary/#unicode_scalar_value>
    println!(
        "{} Unicode scalar values: {:?}",
        s.chars().count(),
        s.chars().collect::<Vec<char>>()
    );

    println!(
        "{} bytes: {:02x?}",
        s.len(),
        s.as_bytes()
    );
}

(Playground)

Output:

1 grapheme cluster: å
2 Unicode scalar values: ['a', '\u{30a}']
3 bytes: [61, cc, 8a]

cole-miller · February 7, 2022, 1:20am

My understanding is that it's considered undesirable for the standard library to bundle all the UCD data that's required to implement segmentation into graphemes and grapheme clusters. (One use of str::chars would be as a primitive on top of which iteration over graphemes, etc., can be built.)

kornel · February 7, 2022, 2:11am

.chars() is very unfortunately named. These are not characters, but unicode codepoints. Codepoints may represent multiple characters at once (digraphs, ligatures), or 1 character, or a fragment of a character (combining marks), or a fragment of a fragment of a character (emoji zwj sequences), or a control code that is not even displayable (directionality control, language tag, font selector, object placeholder).

So yes, at very least it should use some other name (golang uses rune) to steer people looking for "characters" away from the codepoints.

Grapheme clusters are closer to being "characters", but human writing systems are complex, and not all even have an equivalent of a "character" as we know it from latin scripts (a fun example is Hangul which has letters, but they are not characters! It's a syllabary, not an alphabet). What you should use very much depends on what you're trying to do, and how much are you worried about having nonsense presentation/ortography/editing of non-latin text.

Rust's situation is a pragmatic compromise. Full, proper support for Unicode is very complex, requires a lot of algorithms (which are locale specific), and it's not easy to use them properly. On the other end just working with bytes is terrible. So Rust supporting code points only is not as bad as bytes, and not as expensive as full Unicode support.

system · May 8, 2022, 2:11am

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
How do you iterate over grapheme clusters of a String in Rust?	11	13895	July 3, 2022
Why we need "char" data type? help	3	395	October 20, 2023
More efficient conversion from utf8 bytes to a string? help	8	471	July 29, 2022
Do there exist unicoded strings where len()/python and len()/rust are different?	8	2372	January 12, 2023
Truncating a string help	10	1920	October 3, 2022

Should you really use `.chars()` for characters in a string?

Related Topics