String implementation

Here is an example of how several codepoints may represent a single glyph:

fn main() {
    let m = "\u{1F46E}\u{200D}\u{2642}";
    let x = "\u{1F46E}";
    let f = "\u{1F46E}\u{200D}\u{2640}";
    println!("{m} ({} chars and {} bytes)", m.chars().count(), m.len());
    println!("{x} ({} chars and {} bytes)", x.chars().count(), x.len());
    println!("{f} ({} chars and {} bytes)", f.chars().count(), f.len());
}

(Playground)

Output:

👮‍♂ (3 chars and 10 bytes)
👮 (1 chars and 4 bytes)
👮‍♀ (3 chars and 10 bytes)

(Depending on your font, you might need to look closely to see the difference.)

:man_police_officer:
:policeman:
:policewoman:


If we cut one of these strings at the codepoint boundaries, we get this:

fn main() {
    let mut buf = [0u8; 4];
    let m = "\u{1F46E}\u{200D}\u{2642}";
    println!("{m} ({} chars and {} bytes)", m.chars().count(), m.len());
    println!("First `char`: '{}'",
        m.chars().skip(0).next().unwrap().encode_utf8(&mut buf)
    );
    println!("Second `char`: '{}'",
        m.chars().skip(1).next().unwrap().encode_utf8(&mut buf)
    );
    println!("Third `char`: '{}'",
        m.chars().skip(2).next().unwrap().encode_utf8(&mut buf)
    );
}

(Playground)

Output:

👮‍♂ (3 chars and 10 bytes)
First `char`: '👮'
Second `char`: '‍'
Third `char`: '♂'

Where the second "char" is the Zero-width joiner.

There are other weird characters like that such as optional hyphens, byte-order marks (BOM), modifying accents, etc., etc.

My point is: Processing Unicode is always difficult and requires a lot of care. Indexing the "Unicode Scalar Values" or codepoints doesn't make sense in all cases (as you might end up suddenly seeing symbols like :male_sign: as in the above example, which are invisible otherwise.

5 Likes