Why doesn't the string split account for the separator appearing more than once?

So this code is working a bit unexpected to me:

fn main() {
    let s = "1-2-3---4";
    for ss in s.split('-') {
        println!("Split: {}", ss);
    }
}

Producing

Split: 1
Split: 2
Split: 3
Split: 
Split: 
Split: 4

I know this is exactly what is documented but isn't this something nobody really uses? For I do not see any use case for that kind of splitting behaviour:

If a string contains multiple contiguous separators, you will end up with empty strings in the output:

let x = "||||a||b|c".to_string();
let d: Vec<_> = x.split('|').collect();

assert_eq!(d, &["", "", "", "", "a", "", "b", "c"]);

Why would this ever be useful to somebody? Perhaps, we should skip all the separators in this case and so avoid producing empty strings?

For example to parse a CSV file. A CSV file may contain empty strings between commas.

5 Likes

Great example, thanks. But shouldn't there also be a possibility then to not always split like this and avoid empty strings? I think it would be nice.

Just filter the iterator before collecting .split('|').filter(|s| !s.is_empty()). It's the same complexity.

1 Like

It's consistent. The library shouldn't try to second guess the user, that's a recipe for disaster.

Anyway, you can always filter out empty matches, so it's not a problem.

1 Like

Wouldn't it be a bit nicer to have a shortcut like split_X instead of doing that or whatever else that is more optimised, perhaps? There is a split_once that could have also been done on top of a split, but it is there still.

We can't put every combination of possible data processing operations in the library. That would be unmamageable. Rather, APIs should do exactly one thing well, and be compositional, so that you can mix and match them. Resundant special cases are only warranted for very frequent or non-trivial combinations, and filtering out empty strings is trivial.

I don't know, there's already a lot of split functions on str in the standard library...

The number of the "split" functions there says different. I know what you mean, and I also like the UNIX paradigm and KISS and etc, but just there are tons of others that could have also been just a little bit of code done on top of a split. Perhaps, except for the rsplit. Honestly, when I looked over them before asking this question here, I had already had a feeling that what I would be asking her would be already implemented under some name there, but it wasn't so.

The number of split functions is arguably a design smell and a sign that there exists a simpler, more orthogonal API underneath. It’s unfortunate we don’t have such an API, but at least we shouldn’t be adding ever more special-case functions.

7 Likes

The other option would be to use regex::split(); YMMV as to whether that is more or less efficient/convenient than filtering out empty strings (I honestly don't know)

1 Like