Why did the Rust team decide on an inconsistent approach to invalid UTF-8 encoded data?

philomathic_life · February 2, 2020, 10:54pm

I will edit my comments using the syntax used by @BurntSushi, namely U+FFFF instead of 0xFFFF.

alice · February 2, 2020, 10:55pm

Can you cite the exact passage on wikipedia you are referring to? The phrase "must never appear in a valid UTF-8 sequence" only shows up once, and it does not appear relevant.

chrisd · February 2, 2020, 10:58pm

The table in the wikipedia article is talking about "UTF-8 code units (individual bytes or octets)" not Unicode scalar values. So it means those bytes will never appear in UTF-8. Not that those unicode values will never appear.

BurntSushi · February 2, 2020, 10:59pm

It doesn't say that. You are mixing up Unicode and its various encodings.

philomathic_life · February 2, 2020, 11:08pm

Wow! I am brain dead. You've been saying this the whole time, and I finally understand what you mean. I'm very sorry for wasting your time. It clearly states above the table that it's talking about "UTF-8 code units". Thank you for your patience.

philomathic_life · February 2, 2020, 11:11pm

Yes, it does indeed. I'm sorry about my brain fart.

system · May 2, 2020, 11:11pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Confusion about strings help	5	1452	January 12, 2023
Documenting that unicode-escaped characters in utf-8 literals use utf-32 representation	4	1514	January 12, 2023
Support beyond UTF-8? help	11	6226	January 12, 2023
Why is a char valid in JVM but invalid in Rust? help	9	991	June 26, 2022
Fast ASCII and UTF-8 byte slice validation in Rust community	4	1351	January 12, 2023

Why did the Rust team decide on an inconsistent approach to invalid UTF-8 encoded data?

Related Topics