Why did the Rust team decide on an inconsistent approach to invalid UTF-8 encoded data?

philomathic_life · February 2, 2020, 9:08pm

First, my intuition leads me to believe that this question had to have already been asked somewhere; so if it is correct and my Google-fu is deficient, I do apologize. Also, while I am sure only individuals that worked on the language early on can answer this question, I would appreciate conjectures to their line of thinking.

Anyway, why did the team decide to handle binary data that is invalid UTF-8 (e.g., a single surrogate code point) with a panic!, Option<T>, etc.—something I agree with—but allow invalid UTF-8 binary data (e.g., code point U+FFFF)—something I disagree with? I feel like char should have been even more restrictive than just Unicode scalar values.

Addendum

use std::char;

fn main() -> () {

    match char::from_u32(0xdfff) {
        Some(x) => println!("{}", x.len_utf8()),
        None => println!("Invalid UTF-8."),
    };

    match char::from_u32(0xffff) {
        Some(x) => println!("{}", x.len_utf8()),
        None => println!("Invalid UTF-8."),
    }
}

To me the second match expression should panic! since U+FFFF is invalid UTF-8 and thus asking what the length in UTF-8 is is nonsense. The first match expression makes sense since U+DFFF is not a Unicode scalar value and thus is not convertible to a char.

Edits above and in comments

Due to a poor choice of syntax by me, I edited the code points 0xFFFF and 0xDFFF to be U+FFFF and U+DFFF respectively in order to separate what I was trying to refer to as the code point value and not the encoding. I apologize for the confusion.

chrisd · February 2, 2020, 9:30pm

Are you talking about noncharacters? This unicode FAQ might help. This answer seems particularly relevant:

Q: So how should libraries and tools handle noncharacters?

A: Library APIs, components, and tool applications (such as low-level text editors) which handle all Unicode strings should also handle noncharacters. Often this means simple pass-through, the same way such an API or tool would handle a reserved unassigned code point. Such APIs and tools would not normally be expected to interpret the semantics of noncharacters, precisely because the intended use of a noncharacter is internal. But an API or tool should also not arbitrarily filter out, convert, or otherwise discard the value of noncharacters, any more than they would do for private-use characters or reserved unassigned code points.

BurntSushi · February 2, 2020, 9:31pm

Can you express your question in terms of code? I'm not sure I understand your question otherwise. I suspect there is a misunderstanding somewhere, but I'm not quite sure where.

Yandros · February 2, 2020, 9:32pm

Could you provide an example of what you mean? chars are explicitely defined as representing Unicode Scalar Values, which include the code point 0xFFFF.

But 0xFFFF being a valid char does not imply that the 0xff, 0xff is a valid UTF-8 sequence: Playground

philomathic_life · February 2, 2020, 9:46pm

I added code to my question and stated where I believe a panic! should occur. Namely, calling len_utf8 on an invalid UTF-8 char.

BurntSushi · February 2, 2020, 9:46pm

0xFFFF in your program is a codepoint, not UTF-8 encoded bytes. You literally cannot have an invalid char because it is always a Unicode scalar value. Internally, it is a 32-bit integer that corresponds to a codepoint value. char is not tied to any specific encoding.

philomathic_life · February 2, 2020, 9:50pm

In the second match expression, I am asking for its length if it were encoded in UTF-8. To me the question is nonsense since it can not be encoded to UTF-8 to begin with; thus it should panic!.

BurntSushi · February 2, 2020, 9:53pm

Of course it can be encoded at UTF-8. U+FFFF is a Unicode scalar value. Therefore, it can be encoded as UTF-8. The only codepoints that cannot be encoded as UTF-8 are surrogate codepoints, and those correspond precisely to the range U+D800 to U+DFFF (inclusive).

(And even if it weren't a valid codepoint, then it wouldn't panic. It would just print "invalid UTF-8.")

philomathic_life · February 2, 2020, 9:56pm

The following code also runs without a panic!:

use std::char;

fn main() -> () {

    let mut binary_data: [u8; 3] = [0; 3];

    match char::from_u32(0xffff) {
        Some(x) => println!("{}", x.encode_utf8(&mut binary_data)),
        None => println!("Invalid UTF-8."),
    }
}

BurntSushi · February 2, 2020, 9:57pm

As it should. Because the UTF-8 encoding of U+FFFF is \xEF\xBF\xBF.

Again, even if the given u32 wasn't a Unicode scalar value, then it wouldn't panic. It would just print "Invalid UTF-8." Every inhabitant of the char type corresponds to a Unicode scalar value, and thus, correspondingly has a UTF-8 encoding.

philomathic_life · February 2, 2020, 10:06pm

My question is about the underlying argument that led the team to decide to treat lone surrogate code points as invalid UTF-8 but not treat code points like U+FFFF as invalid too. According to FileFormat.Info the surrogate code point U+DFFF could be encoded in UTF-8 as \x3F; but since it is invalid UTF-8, it makes more sense to not treat it as UTF-8 at all. I don't see why U+FFFF is not treated the same as well.

According to Wikipedia, U+FFFF "must never appear in a valid UTF-8 sequence."

BurntSushi · February 2, 2020, 10:12pm

Because it's not a surrogate codepoint.

I kind of feel like we're going in circles here. U+DFFF is a surrogate codepoint. Therefore, by definition, it does not have a valid UTF-8 encoding. U+FFFF is not a surrogate codepoint. Therefore, by definition, it is a Unicode scalar value and has a UTF-8 encoding. Consider reviewing the Unicode FAQ about this: Glossary

No. You're making a category error. The input to char::from_u32 is a Unicode scalar value. This has nothing to do with encodings or UTF-8. Wikipedia, on the other hand, is talking specifically about UTF-8:

Which is correct. b"\xFF\xFF", for example, is not valid UTF-8:

fn main() {
    let bytes = b"\xFF\xFF";
    println!("{:?}", std::str::from_utf8(bytes))
}

Output:

Err(Utf8Error { valid_up_to: 0, error_len: Some(1) })

Playground: Rust Playground

Perhaps the important thing to stress here is that Unicode is not encoding. Unicode is "just" a mapping from characters to numbers. Separately, an encoding determines how those numbers translate back and forth between a byte-oriented representation.

chrisd · February 2, 2020, 10:22pm

To clarify one point of confusion. The unicode value U+FFFF is encoded in UTF-8 as EF BF BF. See this program, for example:

use std::char;

fn main ()
{
    let chr = char::from_u32(0xffff).unwrap();
    let utf8 = chr.to_string();
    println!("{:X?}", utf8.as_bytes());
}

Output:

[EF, BF, BF]

Playground

So 0xFFFF never appears in a Rust UTF-8 string because a char is a 32bit Unicode scalar value, where as a str is the UTF-8 encoding.

philomathic_life · February 2, 2020, 10:29pm

I know that Unicode is different than any encoding of Unicode, but I am confused as to why std::char::encode_utf8 does not panic! or why it does not return a Result<&mut str, Utf8Error> but std::str::from_utf8 behaves the way I think it should.

philomathic_life · February 2, 2020, 10:34pm

std::char::encode_utf8 successfully returns an instance of &mut str when you pass it the Unicode scalar value U+FFFF as a char and an instance of mut [u8; 3]. Isn't that a contradiction to your statement that "0xFFFF never appears in a Rust UTF-8 string"?

trentj · February 2, 2020, 10:38pm

Incidentally, this is incorrect: \x3F is just the character ? (QUESTION MARK). U+DFFF cannot be encoded into UTF-8 in a compliant program. I'm guessing that website has some process that automatically encodes characters and whatever it is is just replacing unencodables with ?.

If you were to encode U+DFFF as if it were not a surrogate, it would be encoded as a 3 byte sequence, like everything in the range U+0800-U+FFFF.

philomathic_life · February 2, 2020, 10:40pm

That makes sense. Something appeared "off" with that claimed encoding.

chrisd · February 2, 2020, 10:41pm

Check the actual bytes that are returned (that's what the as_bytes method does). The char value 0xFFFF is encoded as the UTF-8 value 0xEFBFBF.

alice · February 2, 2020, 10:51pm

No because there's a difference between a string containing the two byte sequence FF FF and a string containing the three byte sequence EF BF BF.

philomathic_life · February 2, 2020, 10:52pm

I am confused. I meant 0xFFFF as the 65535th (base-10) Unicode code point (indexed from 0) not the bye array \xFF\xFF. The fact that it gets encoded as \xEF\xBF\xBF is irrelevant.

To me the statement from Wikipedia that the 65535th code point "must never appear in a valid UTF-8 sequence" and the statement from Rust that "String slices are always valid UTF-8" are contradictions.

Topic		Replies	Views
Confusion about strings help	5	1449	January 12, 2023
Documenting that unicode-escaped characters in utf-8 literals use utf-32 representation	4	1508	January 12, 2023
Support beyond UTF-8? help	11	6223	January 12, 2023
Why is a char valid in JVM but invalid in Rust? help	9	987	June 26, 2022
Fast ASCII and UTF-8 byte slice validation in Rust community	4	1344	January 12, 2023

Why did the Rust team decide on an inconsistent approach to invalid UTF-8 encoded data?

Related Topics