We have String::from_utf8_lossy()
which replaces any non utf-8 character with �. I'm wondering why std::str::from_utf8_lossy()
doesn't exists?
UTF8 is a variable-width encoding, so there’s no guarantee that replacement can be made without changing the size of the buffer, which is something that str
can’t do.
That makes sense. Thank you.
The width problem could be solved by sticking to ASCII (replace any invalid bytes with ?
or any other ASCII character). But another problem is that std::str::from_utf8()
doesn't have a buffer to modify — it returns a reference to an existing region of memory (slice), unlike String::from_utf8_lossy()
which is allocating a new String
.
There could in principle be a std::str::from_utf8_mut_lossy_overwriting()
, though.
Assuming you're ok with 1-4 ?
characters per broken utf-8 codepoint sequence. Or arbitrarily long if it's just 10...
continuation bytes, I guess.
Not one that emits char::REPLACEMENT_CHARACTER
, though, since � is three bytes of UTF-8. Thus something like b"a\xFFb"
can't be lossied in-place.
This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.