Why did the Rust team decide on an inconsistent approach to invalid UTF-8 encoded data?

First, my intuition leads me to believe that this question had to have already been asked somewhere; so if it is correct and my Google-fu is deficient, I do apologize. Also, while I am sure only individuals that worked on the language early on can answer this question, I would appreciate conjectures to their line of thinking.

Anyway, why did the team decide to handle binary data that is invalid UTF-8 (e.g., a single surrogate code point) with a panic!, Option<T>, etc.—something I agree with—but allow invalid UTF-8 binary data (e.g., code point U+FFFF)—something I disagree with? I feel like char should have been even more restrictive than just Unicode scalar values.

Addendum

use std::char;

fn main() -> () {

    match char::from_u32(0xdfff) {
        Some(x) => println!("{}", x.len_utf8()),
        None => println!("Invalid UTF-8."),
    };

    match char::from_u32(0xffff) {
        Some(x) => println!("{}", x.len_utf8()),
        None => println!("Invalid UTF-8."),
    }
}

To me the second match expression should panic! since U+FFFF is invalid UTF-8 and thus asking what the length in UTF-8 is is nonsense. The first match expression makes sense since U+DFFF is not a Unicode scalar value and thus is not convertible to a char.

Edits above and in comments

Due to a poor choice of syntax by me, I edited the code points 0xFFFF and 0xDFFF to be U+FFFF and U+DFFF respectively in order to separate what I was trying to refer to as the code point value and not the encoding. I apologize for the confusion.

1 Like

Are you talking about noncharacters? This unicode FAQ might help. This answer seems particularly relevant:

Q: So how should libraries and tools handle noncharacters?

A: Library APIs, components, and tool applications (such as low-level text editors) which handle all Unicode strings should also handle noncharacters. Often this means simple pass-through, the same way such an API or tool would handle a reserved unassigned code point. Such APIs and tools would not normally be expected to interpret the semantics of noncharacters, precisely because the intended use of a noncharacter is internal. But an API or tool should also not arbitrarily filter out, convert, or otherwise discard the value of noncharacters, any more than they would do for private-use characters or reserved unassigned code points.

3 Likes

Can you express your question in terms of code? I'm not sure I understand your question otherwise. I suspect there is a misunderstanding somewhere, but I'm not quite sure where.

2 Likes

Could you provide an example of what you mean? chars are explicitely defined as representing Unicode Scalar Values, which include the code point 0xFFFF.

But 0xFFFF being a valid char does not imply that the 0xff, 0xff is a valid UTF-8 sequence: Playground

4 Likes

I added code to my question and stated where I believe a panic! should occur. Namely, calling len_utf8 on an invalid UTF-8 char.

0xFFFF in your program is a codepoint, not UTF-8 encoded bytes. You literally cannot have an invalid char because it is always a Unicode scalar value. Internally, it is a 32-bit integer that corresponds to a codepoint value. char is not tied to any specific encoding.

1 Like

In the second match expression, I am asking for its length if it were encoded in UTF-8. To me the question is nonsense since it can not be encoded to UTF-8 to begin with; thus it should panic!.

Of course it can be encoded at UTF-8. U+FFFF is a Unicode scalar value. Therefore, it can be encoded as UTF-8. The only codepoints that cannot be encoded as UTF-8 are surrogate codepoints, and those correspond precisely to the range U+D800 to U+DFFF (inclusive).

(And even if it weren't a valid codepoint, then it wouldn't panic. It would just print "invalid UTF-8.")

2 Likes

The following code also runs without a panic!:

use std::char;

fn main() -> () {

    let mut binary_data: [u8; 3] = [0; 3];

    match char::from_u32(0xffff) {
        Some(x) => println!("{}", x.encode_utf8(&mut binary_data)),
        None => println!("Invalid UTF-8."),
    }
}

As it should. Because the UTF-8 encoding of U+FFFF is \xEF\xBF\xBF.

Again, even if the given u32 wasn't a Unicode scalar value, then it wouldn't panic. It would just print "Invalid UTF-8." Every inhabitant of the char type corresponds to a Unicode scalar value, and thus, correspondingly has a UTF-8 encoding.

2 Likes

My question is about the underlying argument that led the team to decide to treat lone surrogate code points as invalid UTF-8 but not treat code points like U+FFFF as invalid too. According to FileFormat.Info the surrogate code point U+DFFF could be encoded in UTF-8 as \x3F; but since it is invalid UTF-8, it makes more sense to not treat it as UTF-8 at all. I don't see why U+FFFF is not treated the same as well.

According to Wikipedia, U+FFFF "must never appear in a valid UTF-8 sequence."

Because it's not a surrogate codepoint.

I kind of feel like we're going in circles here. U+DFFF is a surrogate codepoint. Therefore, by definition, it does not have a valid UTF-8 encoding. U+FFFF is not a surrogate codepoint. Therefore, by definition, it is a Unicode scalar value and has a UTF-8 encoding. Consider reviewing the Unicode FAQ about this: Glossary

No. You're making a category error. The input to char::from_u32 is a Unicode scalar value. This has nothing to do with encodings or UTF-8. Wikipedia, on the other hand, is talking specifically about UTF-8:

Which is correct. b"\xFF\xFF", for example, is not valid UTF-8:

fn main() {
    let bytes = b"\xFF\xFF";
    println!("{:?}", std::str::from_utf8(bytes))
}

Output:

Err(Utf8Error { valid_up_to: 0, error_len: Some(1) })

Playground: Rust Playground

Perhaps the important thing to stress here is that Unicode is not encoding. Unicode is "just" a mapping from characters to numbers. Separately, an encoding determines how those numbers translate back and forth between a byte-oriented representation.

8 Likes

To clarify one point of confusion. The unicode value U+FFFF is encoded in UTF-8 as EF BF BF. See this program, for example:

use std::char;

fn main ()
{
    let chr = char::from_u32(0xffff).unwrap();
    let utf8 = chr.to_string();
    println!("{:X?}", utf8.as_bytes());
}

Output:

[EF, BF, BF]

Playground

So 0xFFFF never appears in a Rust UTF-8 string because a char is a 32bit Unicode scalar value, where as a str is the UTF-8 encoding.

3 Likes

I know that Unicode is different than any encoding of Unicode, but I am confused as to why std::char::encode_utf8 does not panic! or why it does not return a Result<&mut str, Utf8Error> but std::str::from_utf8 behaves the way I think it should.

std::char::encode_utf8 successfully returns an instance of &mut str when you pass it the Unicode scalar value U+FFFF as a char and an instance of mut [u8; 3]. Isn't that a contradiction to your statement that "0xFFFF never appears in a Rust UTF-8 string"?

Incidentally, this is incorrect: \x3F is just the character ? (QUESTION MARK). U+DFFF cannot be encoded into UTF-8 in a compliant program. I'm guessing that website has some process that automatically encodes characters and whatever it is is just replacing unencodables with ?.

If you were to encode U+DFFF as if it were not a surrogate, it would be encoded as a 3 byte sequence, like everything in the range U+0800-U+FFFF.

1 Like

That makes sense. Something appeared "off" with that claimed encoding.

Check the actual bytes that are returned (that's what the as_bytes method does). The char value 0xFFFF is encoded as the UTF-8 value 0xEFBFBF.

1 Like

No because there's a difference between a string containing the two byte sequence FF FF and a string containing the three byte sequence EF BF BF.

2 Likes

I am confused. I meant 0xFFFF as the 65535th (base-10) Unicode code point (indexed from 0) not the bye array \xFF\xFF. The fact that it gets encoded as \xEF\xBF\xBF is irrelevant.

To me the statement from Wikipedia that the 65535th code point "must never appear in a valid UTF-8 sequence" and the statement from Rust that "String slices are always valid UTF-8" are contradictions.