Should you really use `.chars()` for characters in a string?

It is well known that .len() does not tell you the number of characters in a string, but rather the number of bytes (or should I say code units?) it uses in UTF-8 encoding. For counting characters, .chars().count() is recommended instead.

However, .chars() iterates over Unicode scalar values, of which it might take several to make up a single character. It seems that .graphemes(true) (from the crate unicode-segmentation), which iterates over grapheme clusters, better corresponds to what you'd think of as characters.

Questions:

  1. Isn't .graphemes(true) what you should typically use instead of .chars()?
  2. When do you actually need .chars() except when working with Unicode particulars?
  3. Should .chars() really have been about grapheme clusters all along?
use unicode_segmentation::UnicodeSegmentation;

fn main() {
    let s = concat!(
        "a",        // U+0061 LATIN SMALL LETTER A 
        "\u{030a}"  // U+030A COMBINING RING ABOVE
    );

    // A grapheme cluster is essentially a character.
    // <https://www.unicode.org/glossary/#grapheme_cluster>
    println!(
        "{} grapheme cluster: {}",
        s.graphemes(true).count(),
        s
    );

    // A Unicode scalar value is essentially a Unicode code point.
    // <https://www.unicode.org/glossary/#unicode_scalar_value>
    println!(
        "{} Unicode scalar values: {:?}",
        s.chars().count(),
        s.chars().collect::<Vec<char>>()
    );

    println!(
        "{} bytes: {:02x?}",
        s.len(),
        s.as_bytes()
    );
}

(Playground)

Output:

1 grapheme cluster: å
2 Unicode scalar values: ['a', '\u{30a}']
3 bytes: [61, cc, 8a]

My understanding is that it's considered undesirable for the standard library to bundle all the UCD data that's required to implement segmentation into graphemes and grapheme clusters. (One use of str::chars would be as a primitive on top of which iteration over graphemes, etc., can be built.)

2 Likes

.chars() is very unfortunately named. These are not characters, but unicode codepoints. Codepoints may represent multiple characters at once (digraphs, ligatures), or 1 character, or a fragment of a character (combining marks), or a fragment of a fragment of a character (emoji zwj sequences), or a control code that is not even displayable (directionality control, language tag, font selector, object placeholder).

So yes, at very least it should use some other name (golang uses rune) to steer people looking for "characters" away from the codepoints.

Grapheme clusters are closer to being "characters", but human writing systems are complex, and not all even have an equivalent of a "character" as we know it from latin scripts (a fun example is Hangul which has letters, but they are not characters! It's a syllabary, not an alphabet). What you should use very much depends on what you're trying to do, and how much are you worried about having nonsense presentation/ortography/editing of non-latin text.

Rust's situation is a pragmatic compromise. Full, proper support for Unicode is very complex, requires a lot of algorithms (which are locale specific), and it's not easy to use them properly. On the other end just working with bytes is terrible. So Rust supporting code points only is not as bad as bytes, and not as expensive as full Unicode support.

2 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.