String implementation

jbe · June 23, 2022, 12:08pm

Here is an example of how several codepoints may represent a single glyph:

fn main() {
    let m = "\u{1F46E}\u{200D}\u{2642}";
    let x = "\u{1F46E}";
    let f = "\u{1F46E}\u{200D}\u{2640}";
    println!("{m} ({} chars and {} bytes)", m.chars().count(), m.len());
    println!("{x} ({} chars and {} bytes)", x.chars().count(), x.len());
    println!("{f} ({} chars and {} bytes)", f.chars().count(), f.len());
}

(Playground)

Output:

👮‍♂ (3 chars and 10 bytes)
👮 (1 chars and 4 bytes)
👮‍♀ (3 chars and 10 bytes)

(Depending on your font, you might need to look closely to see the difference.)

If we cut one of these strings at the codepoint boundaries, we get this:

fn main() {
    let mut buf = [0u8; 4];
    let m = "\u{1F46E}\u{200D}\u{2642}";
    println!("{m} ({} chars and {} bytes)", m.chars().count(), m.len());
    println!("First `char`: '{}'",
        m.chars().skip(0).next().unwrap().encode_utf8(&mut buf)
    );
    println!("Second `char`: '{}'",
        m.chars().skip(1).next().unwrap().encode_utf8(&mut buf)
    );
    println!("Third `char`: '{}'",
        m.chars().skip(2).next().unwrap().encode_utf8(&mut buf)
    );
}

(Playground)

Output:

👮‍♂ (3 chars and 10 bytes)
First `char`: '👮'
Second `char`: '‍'
Third `char`: '♂'

Where the second "char" is the Zero-width joiner.

There are other weird characters like that such as optional hyphens, byte-order marks (BOM), modifying accents, etc., etc.

My point is: Processing Unicode is always difficult and requires a lot of care. Indexing the "Unicode Scalar Values" or codepoints doesn't make sense in all cases (as you might end up suddenly seeing symbols like as in the above example, which are invisible otherwise.

Topic		Replies	Views
Some traits to ease the use of Vec of chars as a String	14	1196	September 4, 2022
Using a type which doesn't implement some trait help	3	406	August 4, 2020
Why we need "char" data type? help	3	678	October 20, 2023
Frank's Rust String Class	31	5930	January 12, 2023
An explaination of &str vs String help	6	1130	November 6, 2019

String implementation

Related topics