I want to call String.truncate but it may panic, and this is unacceptable as it introduces a DOS attack vector. So in my ad-hoc std lib ystd I implemented a lossy version of this function that I was already using like this:
/// Like [String::truncate] but doesn't panic.
///
/// Somebody please optimize this implementation
fn truncated_lossy(mut self, new_len: usize) -> String {
// SAFETY: We then copy basically the whole string confirming its all UTF-8
unsafe { self.as_mut_vec() }.truncate(new_len);
String::from_utf8_lossy(self.as_bytes()).into_owned()
}
I'm sure some better rustaceons would love to spend a few minuteshours thinking up the optimal solution, so I post it here and will copy+paste+cargo release a new version of ystd when that happens
the idea of "a lossy string truncation" is inherently problematic for utf8.
by convention, the term "truncate" implies the output is shorter (or equal) to the input, but a "lossy" conversion for unicode string involves substituting invalid code units with U+FFFD, which, in utf8 encoding, could actually increase the string length.
I don't know what's your intension for this API is, but with your implementation, you may get very supprising result (playground):
let s = String::from("£");
assert!(s.len() == 2);
let s = truncated_lossy(s, 1);
assert!(s.len() == 3);
I think practically the more useful one is truncation to a length that has been rounded down to nearest code point boundary, which I would suggest a name like truncate_floor(), truncate_upper_bound(), or something the like.
for this, the implementation is trivial:
/// output is not longer than `new_len`
fn truncate_floor(s: String, mut new_len: usize) -> String {
if new_len >= s.len() {
return s;
}
let mut bytes = s.into_bytes();
loop {
let b = bytes[new_len];
if b < 128 || b >= 192 {
break;
}
new_len -= 1;
}
bytes.truncate(new_len);
String::from_utf8(bytes).unwrap()
}
EDIT:
it's even simpler using str::is_char_boundary() as suggested by @Bruecki
Yes, this seems the best implementation, because it mirrors String::truncate in its dealing with the panic condition. I'm very glad I asked this question, so many interesting solutions, but this is the one I'll stick with, with a &mut self and self version