Split &str in two on a pattern

Hi folks. I'm learning Rust and would appreciate some feedback on this code.

Very often when I am dissecting text, I just want to divide it into two parts: the part before some pattern, and what follows the pattern. For example, with a URL I might want to split on the first colon to get the protocol. (Of course there are already URL parsers out there, but just as an example....)

I find that repeatedly splitting a slice in two like this makes code readable, because you can use destructuring, like so:

if let Some((protocol, rest)) = url.split2(':') {

Or at least you could, if split2() existed. Nesting such if-lets seems much cleaner to me than repeatedly getting Iterators and rummaging through them.

So as an exercise I wrote some code for split2() and rsplit2(), but I'm still a bit new to Rust, so I thought I'd ask for feedback. If you cared about such things, is this the way you would do it? Are there things I've done naively in the code?

And would you use Some for this, or would you create your own enum to get rid of that extra pair of parens:
if let Parts(protocol, rest) = url.split2(':') {
Of course then you'd either have to qualify Parts or use it. Not sure it's worth it.

Thanks for any constructive criticism.

1 Like

I would definitely use the standard Option type, because it has so many helpers available.

That said, for any non-trivial parser, I'd suggest using one of the zero-copy parsing libraries available. They'd have the same property of returning &str (or &[u8]) for each component without copying the original, but they'd substantially simplify error handling and alternative cases.

1 Like

I can't tell if you're asking for a code review of what's just a learning exercise, or if you're asking whether we should actually add split2 et al to the Rust standard library (if the latter, what about @KrishnaSannasi's suggestion to use splitn()?)

2 Likes

I noticed you're writing your library around Pattern, which is a comparatively abstract and opaque interface, but all of your examples use strings or characters as needles to split on. If you only need strings (and equivalents), rather than more complicated pattern matching, it might be neater to take advantage of the simpler interface:

fn split_by_first<'a>(s: &'a str, needle: &str) -> Option<(&'a str, &'a str)>
{
    let l = needle.len();
    let break_point = s.find(needle); // Or .rfind
    break_point.map(|b| (&s[..b], &s[b+l..]))
}

Obviously this doesn't scale so nicely if you want anything other than 2 parts, and doesn't work at all efficiently if the matched string doesn't have a static length.

Also, if I were reading this in a library, I would be disconcerted by it being a trait unto itself. Is the method syntax the only reason for that?

@lxrec AIUI OP, they're asking for code review, although their the playground code already uses splitn()

1 Like

Yes, I initially wrote this to take a u8 instead of a Pattern so that I could know the length was 1, but it wasn't that hard to make it take a Pattern. I didn't actually have to use the Pattern and Searcher APIs -- I just had to get the function signature right to accept a Pattern argument.

So you would actually prefer that this just be a plain function, rather than defining a trait and an implementation for it? That caught me by surprise -- can you tell me what conditions would cause you to prefer making a trait, and without which you would just offer a plain function? How do you choose? I figured that since invoking every other split variant used method syntax, it should work that way for this one as well.

@lxrec I wouldn't presume to suggest that this be added to the standard library. I did something similar for Scala long ago and liked what it did for my code, so I thought I would try it here, even though I don't have an immediate need for it in Rust. There's nothing wrong with this code:

let mut split2 = url.splitn(':', 2);
if let (Some(protocol), Some(rest)) = (split2.next(), split2.next()) {

But why not DRY out the Iterator parts of it? I find

if let Some((protocol, rest)) = url.split2(':') {

much more readable. What I want to do is about text, not about Iterators.

Also, looking at the Iterator code, one might wonder such things as

We're calling next() twice on an Iterator; what if the first returns None -- is the second UB?

Turns out it's fine, but it's a bit distracting. And

What if the text is "https:" or ":foo"? Is the condition true or false?

Writing split2 hides the former in the implementation and gives me a place to document the latter.

You can do so with itertools.

UB is a clearly defined term and it's rather like end-of-the-world kind of error. In safe Rust it's not possible to trigger UB in any way. But calling .next() after None should be considered as a logic error.

In Rust zero length string slices are valid and common. For me "https:" is a concatenation of "https" and "" with ":" as a delimiter.

2 Likes

Just in case you don't know, &str and char both implement Pattern IIRC so your function can accept either of those and pass them to splitn with no conversion.

I would personally use a trait when I could foresee the method(s) having more than one implementation.

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.