Why is a char valid in JVM but invalid in Rust?

Hi everyone, I have a problem, why is this None in Rust?

char::from_u32(55296) // '\u{D800}'

In JVM, this value can be easily converted to char (It does not report any errors)

Thank you in advance for your answer!

Perhaps because Kotlin is buggy. U+55296 is not a valid unicode character.

At least according to: U+55296

I notice your JVM link produces '?' when you hit the run button. Probably trying to say it does not know what 55296 is.

Please don't link to that that Kotlin page any more. It disables the 'back' button.

1 Like

U+D800 is a high-surrogate code point; it is not a real Unicode character. The Java Language Specification defines a char as any numeric value from '\u0000' to '\uffff', which is usually used to represent a UTF-16 code unit.

Meanwhile, a char in Rust stores a Unicode scalar value, which is any code point from U+0000 to U+10FFFF that is not a surrogate code point. Since U+D800 is a surrogate code point, it is not a Unicode scalar value, and it cannot be stored in a char.

Rust defines its char in this way to more closely match Unicode: each char encodes a single Unicode character. Since surrogate code points are an artifact of the UTF-16 encoding, they are invalid as characters, and their existence outside of UTF-16 can only be caused by an error. This is why you cannot convert 0xD800 into a char value.

15 Likes

Thank you for your explanation. In fact, I'm parsing an Android Dex file, which uses 0xd800, so is there any way to convert it to the type in Rust? or should I use [u8; 2] to store it?

The problem is that "character" is an ambiguous statement.

Rust is using a https://www.unicode.org/glossary/#unicode_scalar_value, whereas Java is using a UTF-16 https://www.unicode.org/glossary/#code_unit.

Basically, Java -- like C# and Windows -- suffers from being created back when people thought 16 bits would be enough for Unicode, and designed-in the now obsolete UCS-2 encoding. UTF-16 is a horrible middle-ground, which is why newer versions of C# now have a Rune type, for example, which is a unicode scalar value like Rust's char, and they're talking about Introduce a Utf8String type · Issue #933 · dotnet/runtime · GitHub (like how Rust's String type is UTF-8).

9 Likes

You should read it in a u16 in Rust.

If you have multiple, you could then use String::from_utf16 to try to convert them into a rust String. Note that Java tends not to actually enforce valid UTF-16 in its strings (like how Windows filenames don't have to be valid UTF-16) so it's absolutely possible for this to fail.

5 Likes

My internet died while editing this, so I post it without having caught up to the thread...

It is because char in Java is a UTF-16 code unit, but in Rust char is a Unicode scalar value.

Java was born in the brief window of time just after most people finally realized that 8 bits is not enough to encode any symbol that appears even in ordinary English, let alone multilingual text. So Java's char type is 16 bits, which was, in 1995, enough to encode any Unicode character.

Unfortunately, it was merely a year later that the Unicode consortium admitted that they had goofed, and, in fact, 16 bits is still not enough to encode any symbol that occurs in text. Consequently, they extended the Unicode standard to 31 bits. But, since there was a lot of industry momentum behind 16-bit characters, they didn't make the old encoding obsolete. Instead, they reserved the code points D800-DFFF as "surrogate" code points, and UTF-16 was born.

There are no characters corresponding to the surrogate code points. They are reserved so that UTF-16 can encode code points greater than 0xFFFF. In UTF-16, a scalar value (roughly, "a character") may be encoded either as one 16-bit (non-surrogate) code unit, or as two 16-bit code units that are both surrogates. So software like the JVM can still consider 16 bits (a UTF-16 code unit) as the fundamental atom of text, and it still mostly works with surrogate pairs, as long as you don't do anything too stupid. I'll let you infer how well "just don't do anything stupid" works on programmers.

Rust, on the other hand, was invented in a more enlightened time. The lesson of the past is that fixed-length encoding is a waste of space and awkward for compatibility reasons, so Rust's str type uses UTF-8. When you deal with code units in UTF-8, which are bytes, the type you use is simply u8. But if you want to deal with scalar values, which are independent of encoding, Rust provides char, which can represent any scalar value.

To sum up:

  • Rust's char is a 32-bit number that represents a scalar value. It is encoding-independent, and can represent any valid code point except D800-DFFF, which are reserved. Java doesn't have a type like this.

  • Java's char is a 16-bit number that represents a UTF-16 code unit. (Not a "code point"!) As such, it can contain surrogates, but not code points greater than FFFF. Rust doesn't have a type like this (because Rust doesn't use UTF-16 pervasively, so there's no need for one).

26 Likes

Amazing!!!! I have gained a lot of new knowledge, thanks to several excellent answers! It helps me a lot, thanks again!

2 Likes

In order to clear up the confusion about what a "character" (which is not a thing), a code point, a code unit, a grapheme cluster, or Unicode in general is, read the UTF-8 Everywhere Manifesto: https://utf8everywhere.org/

8 Likes