More efficient implementation of `String.truncated_lossy`

I want to call String.truncate but it may panic, and this is unacceptable as it introduces a DOS attack vector. So in my ad-hoc std lib ystd I implemented a lossy version of this function that I was already using like this:

	/// Like [String::truncate] but doesn't panic.
	/// 
	/// Somebody please optimize this implementation
	fn truncated_lossy(mut self, new_len: usize) -> String {
		// SAFETY: We then copy basically the whole string confirming its all UTF-8
		unsafe { self.as_mut_vec() }.truncate(new_len);
		String::from_utf8_lossy(self.as_bytes()).into_owned()
	}

Source here: YMap/ystd/src/string.rs at 6b8261119e918b2f63dade10b65889a28715912a · ActuallyHappening/YMap · GitHub

I'm sure some better rustaceons would love to spend a few minuteshours thinking up the optimal solution, so I post it here and will copy+paste+cargo release a new version of ystd when that happens :slight_smile:

1 Like

Why not just decrease new_len until str::is_char_boundary returns true?

3 Likes

Why use unsafe code?

#![feature(string_from_utf8_lossy_owned)]

pub trait StringExt {
    fn truncated_lossy(self, new_len: usize) -> Self;
}

impl StringExt for String {
    fn truncated_lossy(self, new_len: usize) -> Self {
        let mut bytes = self.into_bytes();
        bytes.truncate(new_len);
        Self::from_utf8_lossy_owned(bytes)
    }
}

fn main() {
    let s: String = "Hello ¥↑↑ World!".into();
    let truncated_lossy = s.truncated_lossy(7);
    println!("{truncated_lossy}");
}
1 Like

the idea of "a lossy string truncation" is inherently problematic for utf8.

by convention, the term "truncate" implies the output is shorter (or equal) to the input, but a "lossy" conversion for unicode string involves substituting invalid code units with U+FFFD, which, in utf8 encoding, could actually increase the string length.

I don't know what's your intension for this API is, but with your implementation, you may get very supprising result (playground):

let s = String::from("£");
assert!(s.len() == 2);
let s = truncated_lossy(s, 1);
assert!(s.len() == 3);

I think practically the more useful one is truncation to a length that has been rounded down to nearest code point boundary, which I would suggest a name like truncate_floor(), truncate_upper_bound(), or something the like.

for this, the implementation is trivial:

/// output is not longer than `new_len`
fn truncate_floor(s: String, mut new_len: usize) -> String {
    if new_len >= s.len() {
        return s;
    }
    let mut bytes = s.into_bytes();
    loop {
        let b = bytes[new_len];
        if b < 128 || b >= 192 {
            break;
        }
        new_len -= 1;
    }
    bytes.truncate(new_len);
    String::from_utf8(bytes).unwrap()
}

EDIT:

it's even simpler using str::is_char_boundary() as suggested by @Bruecki

fn truncate_floor(mut s: String, mut new_len: usize) -> String {
    new_len = usize::min(new_len, s.len());
    while !s.is_char_boundary(new_len) {
        new_len -= 1;
    }
    s.truncate(new_len);
    s
}

END of EDIT

2 Likes
fn truncate_floor(mut s: String, mut new_len: usize) -> String {
    new_len = usize::min(new_len, s.len());
    while !s.is_char_boundary(new_len) {
        new_len -= 1;
    }
    s.truncate(new_len);
    s
}

Yes, this seems the best implementation, because it mirrors String::truncate in its dealing with the panic condition. I'm very glad I asked this question, so many interesting solutions, but this is the one I'll stick with, with a &mut self and self version

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.