Is there anyway to get a &mut str Split from str::split?

Hi all,

I'm try the leetcode #5571: reverse the words in string. And the basic idea is s.split(' ').map(...).collect().join(" "), that works good.

But I'm wondering why &mut str lacks a split_mut() methods just return a Iterator<Item=&mut str>?

I noticed there is str::split_at_mut(), as well as slice::split_mut(). So why str::split_mut() lacks?

After all, is there any way to give me a mutable string slice iterator when split?

&mut str is pretty rare because it's a very hard type to do anything useful with -- UTF-8 (and unicode segmentations in general) being variable-width means it's usually not possible to replace one substring with another, for example.

Often for leetcode-style problems -- that tend to expect you to do things "wrong" from a full-unicode perspective -- you want to just use the bytes and assume they're ASCII.

P.S. I think you're missing a .rev() in the basic idea.

5 Likes

Thanks for reply! :slight_smile:

the '.rev()' call is in map(...) I have not write it completely.

Thanks for the hint! I have finished the quest by cast String to [u8]:

impl Solution {
    pub fn reverse_words(mut s: String) -> String {
        unsafe {s.as_bytes_mut()}
        .split_mut(|&ch| ch == 32)
        .for_each(|s| s.reverse());
        s
    }
}

Instead of using unsafe { s.as_bytes_mut() }, I would recommend using into_bytes to convert the String into a Vec<u8> altogether, so you don't have to worry about invoking undefined behavior by putting invalid characters in there. In the end, from_utf8 can be used to convert the Vec<u8> back to a String, which also gives you a chance to handle invalid UTF8 if necessary.

No no no, it's UB. Please don't do that.

Let's starts with WHY it's bad. Try run your code with some real world input.

And it crashed with some message below. What happended on it?

Execution operation failed: Output was not valid UTF-8: invalid utf-8 sequence of 1 bytes from index 0

String slices are always valid UTF-8. And the UTF-8 is a variable-width character encoding, means each code point is represented in one or more bytes in the encoded text. For example, the string "์•ˆ" is represented in three bytes [236, 149, 136]. And reversing its bytes produces invalid UTF-8 sequence.

Remember the String slices are always valid UTF-8 guarantee? All guarantees are proved and enforced by the compiler in safe Rust. But in unsafe{} block, it's you who have responsibility to satisfy every guarantees defined by the language and the libraries. Otherwise it's UB, means you may observe crashes at best, or your entire memory address spaces got silently corrupted so totally unrelated part of your code will behave incorrectly.

As a conclusion, try your absolute best to avoid writing any unsafe{} block by hand. It's main purpose is to write safe abstraction of building blocks, on some heavily audited codebase like stdlib, so everyone can play safely on those types like Vec<T> and HashMap<K, V>. Sometimes you may need to write some of it, like interacting with C FFI. In this case, try your best to write your logic in totally safe Rust and minimize the impact of unsafe-ness.

Bonus, this is a totally safe and correct version of your function. Note that this code only reverses code points between whitspces, so multi-codepoint-characters like this emoji ๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ง produces some weird result. But it's the problem of the leetcode question itself. Blame leetcode to serve pre-unicode-era questions!
impl Solution {
    pub fn reverse_words(mut s: String) -> String {
        s.split_whitespace()
            .map(|substr| substr.chars().rev())
            .flatten()
            .collect()
    }
}
4 Likes

Ah, I thought this was "reverse the order of the words", which is a better question as it has more interesting implementations -- the canonical solution being to reverse the words then reverse the whole string, which as a bonus keeps the code units inside the words in the correct order, avoiding the problems that Hyeonu mentioned.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.