Why are str and [char] not equivalent?

The title basically asks the whole question. The only way to convert from str to [char] that I found is:

s.chars().collect::<Vec<_>>()[..]

Why is this so hard, and why are str and [char] not equivalent in the first place?

The short version is that str is UTF-8, where each codepoint (char) is encoded in 1-4 bytes. char is a u32 storing a single Unicode codepoint.

8 Likes

I DO miss the days of ASCII ascendancy. But we humans rush to Babel every chance we get.

Unicode strings could have been defined as the equivalent of an array of chars only if everyone was willing to have strings waste an awful lot of space, especially if they're mostly ASCII.

UTF-16 instead of UTF-8 would allow constant char indexing at the cost of making ascii strings twice as large. But only if you are willing to ignore non-BMP characters indicated by surrogate pairs; meaning it doesn't really solve the problem.

Rust's approach makes sense in a world larger than just U.S. English.

2 Likes

It wouldn't. To cover full unicode space (there are more than 65535 characters defined, Google says it's about 1.1 million currently) you still need extension mechanisms as with utf8, but you need them less often. UTF32 is the only constant width encoding in the UTF encoding family.

1 Like

Read the whole paragraph.

That’s only true if you’re comfortable treating codepoints as separate units, but there are plenty of combining codepoints that only make semantic sense when paired with another one. In some sense, there’s no such thing as a constant-width Unicode encoding.

4 Likes

Surrogate pairs are not the problem. U+7fff (or in that range) and above is the problem.

Talking about codepoints, not graphemes UTF32 is constant width.

But you are right, graphemes can't be encoded constant width.

Surrogate pairs are used in UTF-16 to encode code points not in the BMP. If you don't deal with surrogate pairs, you are automatically restricted to BMP code points. That was my point.

I only mentioned the UTF-16 thing because that is how .NET handles strings, giving you char indexing into code units, but at the cost of spotty non-BMP support since each index may in fact be half of a code point. Lots of existing .NET code deals with this incorrectly, or just ignores surrogate pairs altogether.

Oh, sorry, yes you are right. I was misremembering things and assumed utf16 were working the same way as utf8.

1 Like

No problem. I knew we were trying to make the same point. :grinning:

1 Like

It does, sort of. Surrogate pair codepoints are not assigned codepoints. They're permanent holes in codepoint values to allow UTF-16 to encode two-code-unit-width codepoints.

It's just that a lot of systems were built under the assumption that UCS-2 was enough and treat UTF-16 code units as codepoints, so the reserved surrogate range is treated sort of like real codepoints.

Proper treatment of UTF-16 transparently decodes surrogate pairs into the proper codepoints just like proper treatment of UTF-8 does.

The technical term is "scalar values". Surrogates are code points but not scalar values. It's hard naming things.

True. A lot of broken code though...cough .NET cough... treats a UTF-16 code unit as a character and allows constant-time string indexing, letting higher level code deal with (or ignore) surrogate pairs. Usually ignore.

Unicode naming is a nightmare. I thought surrogates were code units, not code points?

Only when you're talking about UTF-16. Here's the glossary: Glossary

1 Like

Code Unit . The minimal bit combination that can represent a unit of encoded text for processing or interchange. The Unicode Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and 32-bit code units in the UTF-32 encoding form. (See definition D77 in Section 3.9, Unicode Encoding Forms.)

Code Point . (1) Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF16. (See definition D10 in Section 3.4, Characters and Encoding.) Not all code points are assigned to encoded characters. See code point type . (2) A value, or position, for a character, in any coded character set.

Designated Code Point . Any code point that has either been assigned to an abstract character ( assigned characters ) or that has otherwise been given a normative function by the standard (surrogate code points and noncharacters). This definition excludes reserved code points. Also known as assigned code point . (See Section 2.4 Code Points and Characters.)

Unicode Scalar Value . Any Unicode code point except high-surrogate and low-surrogate code points. In other words, the ranges of integers 0 to D7FF16 and E00016 to 10FFFF16 inclusive. (See definition D76 in Section 3.9, Unicode Encoding Forms.)

Surrogate Character . A misnomer. It would be an encoded character having a surrogate code point, which is impossible. Do not use this term.

Surrogate Code Point . A Unicode code point in the range U+D800..U+DFFF. Reserved for use by UTF-16, where a pair of surrogate code units (a high surrogate followed by a low surrogate) “stand in” for a supplementary code point.

Surrogate Pair . A representation for a single abstract character that consists of a sequence of two 16-bit code units, where the first value of the pair is a high-surrogate code unit , and the second is a low-surrogate code unit . (See definition D75 in Section 3.8, Surrogates.)

Clear as mud, no matter how many times I've read it.

5 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.