How to join string slices

I have a program that tokenizes an input. In the first step, the input is split into many small pieces. In the second step, some of these pieces have to be reassembled again, if they're adjacent. So I need a function like this:

fn try_join<'a>(s1: &'a str, s2: &'a str) -> Option<&'a str> {
    let s1 = s1.as_bytes().as_ptr_range();
    let s2 = s2.as_bytes().as_ptr_range();

    if s1.end == s2.start {
        let len = s2.end as usize - s1.start as usize;

        Some(unsafe {
            mem::transmute(std::slice::from_raw_parts(s1.start, len))
        })
    } else {
        None
    }
}

Is this correct? If not, is there a better alternative (that ideally doesn't require unstable features)?

1 Like

The logic looks correct. You can also avoid using transmute by using a couple of other functions.

However, according to the unsafe coding gudielines discussions, as far as I know them, this can't be a safe function, because joining &strs together like this:

  • Is safe if they were originally joined / come from the same string
  • Is UB if it's from two string parts that only happen to be next to each other in memory (this could happen with two adjacent stack allocated variables).

Previous discussion MendSlice *might* be UB. · Issue #25 · bluss/odds · GitHub (second time I reference crate odds today for some reason).

It might be possible to use a wrapper type with branding/generativity to ensure the string parts come from the same original string without any runtime cost, but it's convoluted.

4 Likes

This is undefined behavior due to this kind of thing:

fn main() {
    let b1 = [b'a'];
    let b2 = [b'b'];
    
    let slice1 = std::str::from_utf8(&b1).unwrap();
    let slice2 = std::str::from_utf8(&b2).unwrap();
    println!("{}", try_join(slice1, slice2).unwrap());
}
ab

This violates the pointer provenance rules.

5 Likes

Thanks, that's that I assumed. The strings are guaranteed to come from the same string in my use case, so it is sound, but I'll make it an unsafe function just in case.

However, I'm not sure if str is guaranteed to have the same representation as [u8].

Don't assume that, you don't need transmute. :slight_smile:

1 Like

Which functions are you referring to?

The from_utf8_unchecked method can be used.

3 Likes

Right, how did I not see this! Thanks a lot!

Have the unsplit string slice to begin with; no need then for any unsafe.

Your function does not guarantee it so Rust convention is to mark the function as unsafe.

You’ll probably want to simply also pass a reference to something like the whole input slice that contains both parts. Something like

#[allow(clippy::suspicious_operation_groupings)]
pub fn try_join_in<'a>(universe: &'a str, s1: &'a str, s2: &'a str) -> Option<&'a str> {
    let full_range = universe.as_bytes().as_ptr_range();
    let s1 = s1.as_bytes().as_ptr_range();
    let s2 = s2.as_bytes().as_ptr_range();
    (full_range.start <= s1.start && s1.end == s2.start && s2.end <= full_range.end).then(|| {
        let start = s1.start as usize - full_range.start as usize;
        let end = s2.end as usize - full_range.start as usize;
        &universe[start..end]
    })
}

#[test]
fn test() {
    let input = "Hello, World!";
    let x = &input[7..9];
    let y = &input[9..12];
    assert_eq!(x, "Wo");
    assert_eq!(y, "rld");
    assert_eq!(try_join_in(input, x, y), Some("World"));
}

Should hopefully be fairly trivial to keep &str for the whole input around in a tokenizer and pass it down to the try_join_in call.

5 Likes

Thanks, that's a good idea. However, I probably won't use it; I'm currently using my own string slice type, which looks like this:

pub struct StrSlice {
    start: usize,
    end: usize,
}

It's essentially a Range<usize>, but with some additional methods. Its advantage is that it doesn't borrow the string, so no ownership and lifetime problems. Its disadvantage is that it doesn't borrow the string, so I have to pass the original string to any method that accesses the string slice. I wanted to see if I can get rid of this "hack" and use a normal &str everywhere, but that turned out to be really cumbersome. I got hundreds of errors because Rust methods can't partially borrow, so I'd have to completely rewrite my parser to make it work.