The nomicon tells me this is UB, but what's wrong with it?

I'm doing a leetcode problem that asks me to reverse each word in a string, while keeping the order of the words the same. For example, turn "abc def" into "cba fed". It is guaranteed that the input will be ascii, so I don't need to worry about multi-byte characters. They want constant space and linear time, so doing the reversal using character iterators that I then collect into a new string is out of the question. Here's the first solution I came up with:

impl Solution {
    pub fn reverse_words(mut s: String) -> String {
        let bytes = unsafe { s.as_mut_vec() };
        let mut i=0;
        while i < bytes.len() {
            let mut j = i;
            while j < bytes.len() && bytes[j] != b' ' {
                j += 1;
            }
            let mut a = i;
            let mut b = j - 1;
            while a < b {
                bytes.swap(a, b);
                a += 1;
                b -= 1;
            }
            i = j + 1;
        }
        s
    }
}

I'm quite confident that I used unsafe responsibly there, but I wanted to see if I could do it differently, using more library functions. Here's my second attempt, which also passes the tests, even though I know it's an abomination that should never have been run. I'm also aware that it passing the tests does not necessarily indicate that it didn't trigger UB, because I may have just gotten lucky.

use std::mem::transmute;
#[allow(mutable_transmutes)]
impl Solution {
    pub fn reverse_words(mut s: String) -> String {
        unsafe {
            s
                .split(' ')
                .map(|x| transmute::<&str, &mut str>(x))
                .for_each(|x| x.as_bytes_mut().reverse());
        }
        s
    }
}

I have read the appropriate section of the nomicon telling me that doing something like this is always undefined behaviour, and that no I can't do it, and that no I'm not special. The thing that confuses me is why, even in a situation like this, it's UB. The way split is implemented makes sure that it'll never try to look at each slice after it yields it, and it wouldn't be surprising if the remaining slice held by the Split iterator doesn't even overlap the &str I'm transmuting. What kind of compiler assumptions could possibly be made here that would make this do anything other than what I want it to?

While some &T is active you can't mutate size_of::<T>() bytes since the address unless mutated bytes are wrapped by UnsafeCell at some layer. Since the unsafe block takes &str from s and doesn't touch any other environment, and the str doesn't contains any UnsafeCell, the compiler can assume that the entire unsafe block, so as the whole function, is no-op. Nothing faster than doing nothing so this optimization would be practically desirable.

4 Likes

If I'm understanding correctly, you're saying the compiler could decide to optimize away the whole function because I never take a mutable reference to s through legal means, and the function can't affect anything else. That makes sense, but I can't find a way to fix it with UnsafeCell because the constructor expects owned data, not a shared reference. Is there a way to solve this problem while still using String::split?

No it's not possible. But you can use s.as_mut_vec().split_mut(|c| c == b' ') instead if you can guarantee it's ascii text.

2 Likes

Thanks! I feel a little silly for only looking at the String and str docs and not noticing that split_mut exists for slices :sweat_smile:

You shouldn't need to use unsafe for such a simple problem. I don't understand why you want to keep around the String even though you are explicitly trying to violate its guarantees (i.e. that it's valid UTF-8). Just convert it to its inner vector-of-bytes by value and convert it back at the end if you are convinced it's still valid UTF-8 after this operation.

There is also no need to code up the reversing part yourself, there is a std method on slices for that purpose: Playground.

impl Solution {
    pub fn reverse_words(s: String) -> String {
        let mut bytes = s.into_bytes();
        let mut next_word = 0;

        while next_word < bytes.len() {
            let word_end = bytes[next_word..]
                .iter()
                .position(|&b| b == b' ')
                .map_or_else(|| bytes.len(), |len| next_word + len);

            bytes[next_word..word_end].reverse();
            next_word = word_end + 1;
        }

        String::from_utf8(bytes).expect("valid UTF-8")
    }
}

Edit: thanks to @Hyeonu for reminding us that there is split_mut(): Playground

pub fn reverse_words(s: String) -> String {
    let mut bytes = s.into_bytes();
    bytes.split_mut(|&b| b == b' ').for_each(<[_]>::reverse);
    String::from_utf8(bytes).expect("valid UTF-8")
}
6 Likes

Yeah, I completely forgot that converting a Vec<u8> to a String is an O(n) operation that doesn't involve a reallocation. I remember now that there's also the O(1) unsafe variant, although that doesn't matter here because the algorithm is already O(n). As for the manual reversing in the first snippet, I did it for practice - I doubt an interviewer would be particularly thrilled if my solution to a reversing algorithm question involved any calls to a method called reverse.

Well, personally, I would regard this as an exercise in being able to compose adequate features from the standard library. I would be wary of any interviewer who wants me to code up something such trivial manually, in anything but an intern/junior context.

It depends what they're looking for.

The standard library's reverse is far more optimized than something you're going to put on a whiteboard. Even using unsafe, the "obvious" solution isn't that great, while the standard library's version uses SIMD to be substantially faster: https://rust.godbolt.org/z/13cn4xjjj

2 Likes

Yeah, standard library code is always faster than whatever I would come up with. I think the focus is more meant to be on my problem solving ability, whether I can communicate my strategy to the interviewer, and whether I can implement it without bugs. For example, they don't ask linked list questions because you'll use it on the job, they ask them because they want to see if you'll get tripped up when there are pointers everywhere. It wouldn't surprise me if I get a question that could literally be solved by just calling dedup, but of course they ask it to make sure you understand how to do it manually in O(n), rather than the naive O(n²).

If I were an interviewer here, I would consider an answer using "reverse" better than an answer doing "manual byte fiddling". There isn't a std way to solve the full problem, and I would think beeing able to break a problem down into solved pieces and getting the "glue" right is a more important skill than solving the whole puzzle by hand.

For the same reason, I would probably prefer a solution using u8::is_ascii_whitespace over a solution using b == b' ', while beeing more impressed still over a candidate actually "researching the problem boundaries" by asking me what to about any numbers in or between words (i.e. "should I use !b.is_ascii_alphabetic() or !b.is_ascii_alphanumeric()?").

6 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.