Documenting that unicode-escaped characters in utf-8 literals use utf-32 representation

marcianx · November 6, 2017, 4:42am

There's a rust syntactical fact that was non-obvious to me—rust's utf-8 string literals use utf-32 escape sequences.

fn main() {
    // Sparkling Heart: http://www.fileformat.info/info/unicode/char/1f496/index.htm
    let sh1 = String::from_utf8(vec![0xF0, 0x9F, 0x92, 0x96]).unwrap();
    let sh2 = "\u{1F496}"; // utf-8 literal uses utf-32 escape sequences
    println!("{} == {} -> {}", sh1, sh2, &sh1 == sh2);
}

Playground

Should this perhaps be mentioned in some high-visibility location like the documentation of the &str primitive type? Or is this a well-known convention in the wild for languages that default to utf-8 due to the relative brevity of utf-32 in comparison to utf-8?

jimb · November 6, 2017, 5:03am

I wouldn't say that the \u escapes use UTF-32. They specify a Unicode character by its code point, a concept that is independent of any particular character encoding form. Then, all that needs to be said is that Rust String and str use UTF-8---and that is mentioned prominently in many places.

marcianx · November 6, 2017, 5:16am

Ah I see, thanks for the clarification! I hadn't quite wrapped my mind around the distinction between a code point and its encoding as I don't deal with such things much.

I had just been surprised since according to Unicode Character 'SPARKLING HEART' (U+1F496), C/C++/Java apparently use the UTF-16 encoding for their escape sequences; so I expected to see the UTF-8 encoding in rust's escape sequences.

jimb · November 6, 2017, 5:34am

I think fileformat.info is wrong there, for C and C++. According to cppreference.com, the four or eight hex digits following a \u or \U escape in a C++ string are also a Unicode code point, just like Rust. Rust just uses {curly brackets} instead of a fixed number of digits. So the correct way to write "Sparkling Heart" in a C++ string is "\U0001f496", not the given "\uD83D\uDC96". This C++ playground seems to agree.

In Java and JavaScript, the story is different. Those languages explicitly use UTF-16 strings, so a character like Sparkling Heart needs to be written as a surrogate pair. That's what fileformat.info is showing, I think.

Topic		Replies	Views
Support beyond UTF-8? help	11	6217	January 12, 2023
How to build a char from two utf-8 codes help	3	320	April 13, 2023
Why did the Rust team decide on an inconsistent approach to invalid UTF-8 encoded data?	26	3377	May 2, 2020
Confusion about strings help	5	1446	January 12, 2023
Frank's Rust String Class	31	5760	January 12, 2023

Documenting that unicode-escaped characters in utf-8 literals use utf-32 representation

Related Topics