Splitting &str in place in Rust

How to split a &str once, into exactly two &str's in rust by a predicate, without allocating, invoking utf-8 checks, consuming the separator, or using unsafe code like get_unchecked()?

"abc".split_once(|c| c == 'b').unwrap()

I want

("a", "bc")

But this returns

("a", "c")
let index = string.find('b').unwrap_or(string.len());
let (head, tail) = string.split_at(index);

Playground

5 Likes

I'm afraid there's no stable safe api to do that. find + split_at is usually good enough. Yes, it invoke one utf-8 check, but I really doubt that would be the bottleneck.

3 Likes

Good catch with split_at(), edited my snippet above.

1 Like

utf-8 is self-synchronizing, so only a single byte needs checking.

6 Likes

After fumbling on compiler explorer a bit I don't think any of these optimize into straight up removing the UTF-8 check.But like @tbfleming said it's a relatively cheap operation on split_at() so I'll go with that for now.

The compiler doesn't really understand the UTF-8 invariant, so any safe code will necessarily involve a validation check somewhere. In this case, you can prove that the split point is valid through logical argument, so it would be sound to use unsafe to bypass the check (see below). In the vast majority of cases, though, the runtime cost is so low as to be the better tradeoff vs. the risk of a bug in your unsafe logic.

fn split<'h,'n>(haystack: &'h str, needle: &'n str) -> (&'h str, &'h str) {
    let index = haystack.find(needle).unwrap_or(haystack.len());
    let (h,t) = haystack.as_bytes().split_at(index);
    // SAFETY: `needle` represents a valid UTF-8 sequence, and so any index
    //         where it is found represents a valid split point
    unsafe { (
        std::str::from_utf8_unchecked(h),
        std::str::from_utf8_unchecked(t)
    ) }
}
1 Like

In the context of linearly scanning the str to find the place in the first place, the overhead of checking a single byte for whether it's a utf8 boundary should be irrelevant. Hardly a reason to introduce unsafe, I would say.

6 Likes

Notably, that byte is definitely in L1 cache already -- since it needed to be loaded to know whether it's a 'b'! -- so the check isn't going to cause memory transfers, which are the things that really cause wall-clock costs.

3 Likes