Unicode strings could have been defined as the equivalent of an array of chars only if everyone was willing to have strings waste an awful lot of space, especially if they're mostly ASCII.
UTF-16 instead of UTF-8 would allow constant char indexing at the cost of making ascii strings twice as large. But only if you are willing to ignore non-BMP characters indicated by surrogate pairs; meaning it doesn't really solve the problem.
Rust's approach makes sense in a world larger than just U.S. English.
It wouldn't. To cover full unicode space (there are more than 65535 characters defined, Google says it's about 1.1 million currently) you still need extension mechanisms as with utf8, but you need them less often. UTF32 is the only constant width encoding in the UTF encoding family.
That’s only true if you’re comfortable treating codepoints as separate units, but there are plenty of combining codepoints that only make semantic sense when paired with another one. In some sense, there’s no such thing as a constant-width Unicode encoding.
Surrogate pairs are used in UTF-16 to encode code points not in the BMP. If you don't deal with surrogate pairs, you are automatically restricted to BMP code points. That was my point.
I only mentioned the UTF-16 thing because that is how .NET handles strings, giving you char indexing into code units, but at the cost of spotty non-BMP support since each index may in fact be half of a code point. Lots of existing .NET code deals with this incorrectly, or just ignores surrogate pairs altogether.
It does, sort of. Surrogate pair codepoints are not assigned codepoints. They're permanent holes in codepoint values to allow UTF-16 to encode two-code-unit-width codepoints.
It's just that a lot of systems were built under the assumption that UCS-2 was enough and treat UTF-16 code units as codepoints, so the reserved surrogate range is treated sort of like real codepoints.
Proper treatment of UTF-16 transparently decodes surrogate pairs into the proper codepoints just like proper treatment of UTF-8 does.
True. A lot of broken code though...cough .NET cough... treats a UTF-16 code unit as a character and allows constant-time string indexing, letting higher level code deal with (or ignore) surrogate pairs. Usually ignore.
Code Unit . The minimal bit combination that can represent a unit of encoded text for processing or interchange. The Unicode Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and 32-bit code units in the UTF-32 encoding form. (See definition D77 in Section 3.9, Unicode Encoding Forms.)
Code Point . (1) Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF16. (See definition D10 in Section 3.4, Characters and Encoding.) Not all code points are assigned to encoded characters. See code point type . (2) A value, or position, for a character, in any coded character set.
Designated Code Point . Any code point that has either been assigned to an abstract character ( assigned characters ) or that has otherwise been given a normative function by the standard (surrogate code points and noncharacters). This definition excludes reserved code points. Also known as assigned code point . (See Section 2.4 Code Points and Characters.)
Surrogate Character . A misnomer. It would be an encoded character having a surrogate code point, which is impossible. Do not use this term.
Surrogate Code Point . A Unicode code point in the range U+D800..U+DFFF. Reserved for use by UTF-16, where a pair of surrogate code units (a high surrogate followed by a low surrogate) “stand in” for a supplementary code point.