Asking out of curiosity, why does the String interface use Characters instead of Code Points, and Bytes instead of Code Units?
My understanding is that that a Character (is also a Glyph) is the visual element, and Code Points are the index to the visual elements. Since the code itself is not the visual part of the string, it seems that what ever renders is concerned with Characters. This also makes me wonder about composite Characters where two codepoints are combined to form a Character. (I.e. See: Combining character - Wikipedia). Would this not create confusion since it is up to there renderer (terminal, gui, etc) to determine how to rendering the specific combinations of Code Points?
Similar commentary around using Bytes, Shorts and Wordโs. The term Code Units are used to indicate the smallest unit that a code point can be stored as. In the case of UTF8, that is a byte (u8), UTF16 that is a short (u16) and UTF32 that is a Word (u32). However noting that the code units have specific validation requirements. Since we are all about safety and security, does it not make more sense to present interfaces that implement a CodeUnit trait and define types such as type UTF8CodeUnit = u8;?
What are some examples of applications which would be simplified by having it in std (as unicode-segmentation already has the glyph, word, sentence iterators)?
The TLDR is that a "character" as defined by Unicode is much more associated (generally 1:1) with code points than glyphs or grapheme clusters: the distinction is basically just that character is logical and code point is the number representing it. The meaning has drifted pretty far from the days of lead type.
That said, depending on context a "character" may be more appropriately considered as a grapheme cluster, eg when editing; but only rarely a glyph is what you intend unless you're dealing with rendering (eg a diacritic that may be placed above the base character is often a separate glyph in the font even if you're using a single character representing both)
Bytes is shorter than CodeUnit and happens to be accurate for UTF-8. I really don't think think there's any more to it than that. Being generic wasn't a concern because, as far as the standard library is concerned, UTF-8 is the one true character encoding.
For char it is a bit unfortunate a name given the ambiguity (a C char is different from a Unicode Character, etc). But the correct name here would be, as the docs say, Unicode Scalar Value. Which is a bit of a mouthful.
This is not accurate. There are many characters that are constructed from multiple code points (combining marks, decomposed characters, jamo, ligatures, emoji). There are also code points that represent more than one visual character (precomposed).
There are also plenty of code points that are control characters without any visual meaning themselves (e.g. change of writing direction).
The closest thing you can get to a character is "grapheme cluster", but it's a complex-to-construct variable-width string fragment.
There is no Unicode encoding that supports random access indexing, because Unicode itself at its core is a stateful machine. It has all of the edge cases of all human writing systems and most of other encodings.
Processing on code point level can break text even in Latin-based languages due to NFD code points. That's just how Unicode works.
If Rust supported indexing by code points it would give false impression that this is useful and a correct operation, but it isn't. It's incorrect in Unicode, it just breaks a little bit later than indexing by bytes.
If you have an algorithm that doesn't need to care about semantics of characters, operating on bytes (code units) is simpler and more efficient. If you do need to process text with awareness of visual characters, then you can't rely on just code points.
You'll notice that https://unicode.org/glossary/#character has 4 different definitions, so to me anything saying stuff like "is concerned with Characters" is being insufficiently precise.
My current favourite example why USVs are not the rendering unit (nor are codepoints):