Documenting that unicode-escaped characters in utf-8 literals use utf-32 representation


#1

There’s a rust syntactical fact that was non-obvious to me—rust’s utf-8 string literals use utf-32 escape sequences.

fn main() {
    // Sparkling Heart: http://www.fileformat.info/info/unicode/char/1f496/index.htm
    let sh1 = String::from_utf8(vec![0xF0, 0x9F, 0x92, 0x96]).unwrap();
    let sh2 = "\u{1F496}"; // utf-8 literal uses utf-32 escape sequences
    println!("{} == {} -> {}", sh1, sh2, &sh1 == sh2);
}

Playground

Should this perhaps be mentioned in some high-visibility location like the documentation of the &str primitive type? Or is this a well-known convention in the wild for languages that default to utf-8 due to the relative brevity of utf-32 in comparison to utf-8?


#2

I wouldn’t say that the \u escapes use UTF-32. They specify a Unicode character by its code point, a concept that is independent of any particular character encoding form. Then, all that needs to be said is that Rust String and str use UTF-8—and that is mentioned prominently in many places.


#3

Ah I see, thanks for the clarification! I hadn’t quite wrapped my mind around the distinction between a code point and its encoding as I don’t deal with such things much.

I had just been surprised since according to http://www.fileformat.info/info/unicode/char/1f496/index.htm, C/C++/Java apparently use the UTF-16 encoding for their escape sequences; so I expected to see the UTF-8 encoding in rust’s escape sequences.


#4

I think fileformat.info is wrong there, for C and C++. According to cppreference.com, the four or eight hex digits following a \u or \U escape in a C++ string are also a Unicode code point, just like Rust. Rust just uses {curly brackets} instead of a fixed number of digits. So the correct way to write “Sparkling Heart” in a C++ string is "\U0001f496", not the given "\uD83D\uDC96". This C++ playground seems to agree.

In Java and JavaScript, the story is different. Those languages explicitly use UTF-16 strings, so a character like Sparkling Heart needs to be written as a surrogate pair. That’s what fileformat.info is showing, I think.