Why std::str::from_utf8_lossy() doesn't exists?

ahhshm · August 13, 2022, 6:32am

We have String::from_utf8_lossy() which replaces any non utf-8 character with �. I'm wondering why std::str::from_utf8_lossy() doesn't exists?

2e71828 · August 13, 2022, 6:50am

UTF8 is a variable-width encoding, so there’s no guarantee that replacement can be made without changing the size of the buffer, which is something that str can’t do.

ahhshm · August 13, 2022, 6:54am

That makes sense. Thank you.

kpreid · August 13, 2022, 2:35pm

The width problem could be solved by sticking to ASCII (replace any invalid bytes with ? or any other ASCII character). But another problem is that std::str::from_utf8() doesn't have a buffer to modify — it returns a reference to an existing region of memory (slice), unlike String::from_utf8_lossy() which is allocating a new String.

There could in principle be a std::str::from_utf8_mut_lossy_overwriting(), though.

simonbuchan · August 16, 2022, 9:51pm

Assuming you're ok with 1-4 ? characters per broken utf-8 codepoint sequence. Or arbitrarily long if it's just 10... continuation bytes, I guess.

scottmcm · August 16, 2022, 11:02pm

Not one that emits char::REPLACEMENT_CHARACTER, though, since � is three bytes of UTF-8. Thus something like b"a\xFFb" can't be lossied in-place.

system · November 14, 2022, 11:03pm

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
How to get &'a str from &'a [u8] without any copying involved?	7	565	March 29, 2022
Why no `std::str::from_boxed_utf8` exist?	10	775	October 25, 2022
Is there a missing API for optimal &[u8] -> Result<String> conversion?	10	866	November 26, 2019
How to convers a stream into a string help	5	1672	June 5, 2022
Why did the Rust team decide on an inconsistent approach to invalid UTF-8 encoded data?	26	3381	May 2, 2020

Why std::str::from_utf8_lossy() doesn't exists?

Related Topics