Here is an example of how several codepoints may represent a single glyph:
fn main() {
let m = "\u{1F46E}\u{200D}\u{2642}";
let x = "\u{1F46E}";
let f = "\u{1F46E}\u{200D}\u{2640}";
println!("{m} ({} chars and {} bytes)", m.chars().count(), m.len());
println!("{x} ({} chars and {} bytes)", x.chars().count(), x.len());
println!("{f} ({} chars and {} bytes)", f.chars().count(), f.len());
}
Output:
👮♂ (3 chars and 10 bytes)
👮 (1 chars and 4 bytes)
👮♀ (3 chars and 10 bytes)
(Depending on your font, you might need to look closely to see the difference.)
If we cut one of these strings at the codepoint boundaries, we get this:
fn main() {
let mut buf = [0u8; 4];
let m = "\u{1F46E}\u{200D}\u{2642}";
println!("{m} ({} chars and {} bytes)", m.chars().count(), m.len());
println!("First `char`: '{}'",
m.chars().skip(0).next().unwrap().encode_utf8(&mut buf)
);
println!("Second `char`: '{}'",
m.chars().skip(1).next().unwrap().encode_utf8(&mut buf)
);
println!("Third `char`: '{}'",
m.chars().skip(2).next().unwrap().encode_utf8(&mut buf)
);
}
Output:
👮♂ (3 chars and 10 bytes)
First `char`: '👮'
Second `char`: ''
Third `char`: '♂'
Where the second "char
" is the Zero-width joiner.
There are other weird characters like that such as optional hyphens, byte-order marks (BOM), modifying accents, etc., etc.
My point is: Processing Unicode is always difficult and requires a lot of care. Indexing the "Unicode Scalar Values" or codepoints doesn't make sense in all cases (as you might end up suddenly seeing symbols like as in the above example, which are invisible otherwise.