Why std::str::from_utf8_lossy() doesn't exists?

We have String::from_utf8_lossy() which replaces any non utf-8 character with �. I'm wondering why std::str::from_utf8_lossy() doesn't exists?

UTF8 is a variable-width encoding, so there’s no guarantee that replacement can be made without changing the size of the buffer, which is something that str can’t do.

7 Likes

That makes sense. Thank you.

The width problem could be solved by sticking to ASCII (replace any invalid bytes with ? or any other ASCII character). But another problem is that std::str::from_utf8() doesn't have a buffer to modify — it returns a reference to an existing region of memory (slice), unlike String::from_utf8_lossy() which is allocating a new String.

There could in principle be a std::str::from_utf8_mut_lossy_overwriting(), though.

6 Likes

Assuming you're ok with 1-4 ? characters per broken utf-8 codepoint sequence. Or arbitrarily long if it's just 10... continuation bytes, I guess.

Not one that emits char::REPLACEMENT_CHARACTER, though, since � is three bytes of UTF-8. Thus something like b"a\xFFb" can't be lossied in-place.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.