Unlocked thread to add important note: the version using core::ptr::copy
+ truncate
has a bug and will sometimes panic on non-ASCII text.
The bug occurs because self.truncate(len)
checks if you're cutting on a UTF-8 character boundary. It does this by checking the byte at [len], which is outside the range you just copied, and so could fall in the middle of a multi-byte UTF-8 sequence. (The self.drain(..trim_start)
solution doesn't have this issue, drain handles this stuff.)
For example, try running it on the string " aé"
(playground), which is the byte sequence [20, 61, c3, a9]. The [c3, a9] is the 2-byte code point U+00E9 (é). The trimmed length is 3 bytes, core::ptr::copy
makes it [61, c3, a9, a9]. The second 0xa9 at [3] is not a valid UTF-8 character on its own, it's a continuation. So truncate(3)
will see it and panic.
Using std::string::String
Firstly, the drain(..trim_start)
solution is excellent, and does not have this problem. And is has no unsafe. It will be slightly slower by a few nanoseconds, and if you are trimming many, many tiny strings, that may add up since 2ns is 20% of 10ns. For big strings it will be negligible, because drain uses core::ptr::copy
AKA libc::memmove
and the bulk of either approach will be in that highly optimised routine.
The absolute fastest way is to avoid truncate's UTF-8 boundary check entirely, using String::as_mut_vec
+ Vec::set_len
. That's what the trim-in-place
crate does, so you can just use that.
Here (playground) is a tiny version if you don't like dependencies. It is slightly different in that it returns orig_len - trimmed_len
instead of self.as_str()
. This enables easy checking of whether the trimming actually did anything (string.trim_start_in_place() != 0
), and is a better choice than the original IMO because you already have very easy access to the trimmed string slice if you want it.
What if I'm not using std::string::String?
if you are working with a possibly-inline string, like the smartstring
crate or any of the similar ones out there, then you may not have String::as_mut_vec
available. The reason std's String has as_mut_vec
is because String is just a newtype wrapper for Vec that checks UTF-8 in some places. Some inline or sometimes-inline strings cannot give you such an API, and a short survey (smartstring
no, inlinable_string
no, smallstr
yes, tinyvec_string
yes and no, ascii
no, bstr
no) reveals most don't. All of them have a truncate API.
So the goal is to trick truncate
into not panicking. In a technical sense, there could be up to 3 bytes of invalid UTF-8 adjacent to the final trimmed area. One way to definitively correct the UTF-8 for sure is to blast zeroes up until the end of the string, but that's a bit of a waste. A better way is to write a single valid UTF-8 character after the copied bytes, exactly at the spot truncate() is about to check. This is a bit of hack, but truncate does assume the whole string is valid UTF-8, so it will only ever need to check if you're on a char boundary, not whether the whole string is UTF-8. One byte is therefore sufficient.
The result looks like this (playground).