[Solved] How to split string into multiple sub-strings with given length?


#1
string = "12345678";
sub_len = 2;
sub_string = ["12", "34", "56", "78"]; 

I have tried split_at(&self, mid: usize) -> (&str, &str), but it splits the string into ("12", "345678"), then I get trapped in how to split the second element of the tuple recursively. Could you give me some hint?


#2

Maybe there’s a way to use slice::chunks here, but otherwise you can do something similar manually, e.g.:

let mut v = vec![];
let mut cur = string;
while !cur.is_empty() {
    let (chunk, rest) = cur.split_at(cmp::min(sub_len, cur.len()));
    v.push(chunk);
    cur = rest;
}

#3

nice but wouldn’t be Vec::with_capacity(string.len() / sub_len) better here?

also i made a version with uses chunks

use std::str;
let subs = string.as_bytes()
    .chunks(sub_len)
    .map(str::from_utf8)
    .collect::<Result<Vec<&str>, _>>()
    .unwrap();

#4

Yeah, I thought about that too but didn’t bother to not clutter the example too much.[quote=“juggle-tux, post:3, topic:10542”]
also i made a version with uses chunks
[/quote]

:thumbsup:


#5

str::from_utf8_unchecked() is better here since we started off with a string, so there’s no need to validate the buffer again:

use std::str;
let subs = string.as_bytes()
    .chunks(sub_len)
    .map(|buf| unsafe { str::from_utf8_unchecked(buf) })
    .collect::<Vec<&str>>();

#6

It may still be invalid if your byte chunks split a multibyte UTF-8 character.


#7

Which is why chunks itself might not be a good idea for general string.
Either OP knows somehow which kind of character he is working and he can use chunks/from_utf8_unckeched or he doesn’t and must use string.chars() iterator (probably with a loop and using Iterator::by_ref)


#8

i used from_utf8 and unwrap since split_at also will panic if you try to split a multibyte character.


#9

and finally a version with is utf8 proof

fn sub_strings(string: &str, sub_len: usize) -> Vec<&str> {
    let mut subs = Vec::with_capacity(string.len() / sub_len);
    let mut iter = string.chars();
    let mut pos = 0;

    while pos < string.len() {
        let mut len = 0;
        for ch in iter.by_ref().take(sub_len) {
            len += ch.len_utf8();
        }
        subs.push(&string[pos..pos + len]);
        pos += len;
    }
    subs
}

edit: using iter.by_ref() is nicer


#10

@vitalyd
@juggle-tux
@FaultyRAM
@cuviper
@tafia
Thank for your help, I have learned more than one thing from your explanations.


#11

Here’s one more version using the itertools crate (playpen):
Note: this method needs to allocate new Strings for each sub_string, whereas the other examples here return slices of the source.

extern crate itertools;
use itertools::Itertools;

fn sub_strings(source: &str, sub_size: usize) -> Vec<String> {
    source.chars()
        .chunks(sub_size).into_iter()
        .map(|chunk| chunk.collect::<String>())
        .collect::<Vec<_>>()
}

#12

I believe you are not using itertools here.
You also do not need into_iter.
To recap, with a more functional safe version (not necessarily better)

fn main() {
    let string = "12345678";
    let sub_len = 2;
    
    // Case 1: you don't know the data you're playing with
    //
    // Characters may be single or multiple byte encoded (per definition of utf8)
    // Thus you cannot just chunk the data and MUST rely on `chars()` iterator
    //
    // It also means you cannot return fixed size slices. You need to own each strings
    let mut chars = string.chars();
    let sub_string = (0..)
        .map(|_| chars.by_ref().take(sub_len).collect::<String>())
        .take_while(|s| !s.is_empty())
        .collect::<Vec<_>>();
    
    println!("Safe: {:?}", sub_string);
    
    // Case 2: you work with some 'simple' data where you know in advance that
    // all characters will be single byte encoded.
    //
    // In particular, this is true for all US-ASCII characters
    // see https://en.wikipedia.org/wiki/UTF-8
    //
    // Then, and only then, you can be wild and unsafe and crazy fast
    let sub_string = string.as_bytes()
        .chunks(sub_len)
        .map(|s| unsafe { ::std::str::from_utf8_unchecked(s) }) // unsafe ok because we are certain? we have valid str
        .collect::<Vec<_>>();
    
    println!("Unsafe: {:?}", sub_string);
    
}

#13

I am indeed using itertools to use the chunks iterator adaptor (which requires another into_iter to iterate over). You can also use std::slice::chunks which you’re using for string.as_bytes().chunks(n)

It won’t be as fast as operating on raw bytes or returning &strs, but it’s another option that reads easy and works as expected (chunking chars).

Yet another version using std::slice::chunks instead of itertools’ chunks adapter (this has to pull the chars into a temp Vec):

let chars: Vec<char> = s.chars().collect();
let split = &chars.chunks(2)
    .map(|chunk| chunk.iter().collect::<String>())
    .collect::<Vec<_>>();
println!("{:?}", split);

#14

One could of course implement str-specific versions of chunking as well.

https://docs.rs/odds/0.2.25/odds/string/trait.StrChunksWindows.html

(usual character discussion).


#15

By which I expect you mean the confusion over what’s even a “character” in Unicode – char being a single code point vs. a visual grapheme that may consist of many chars. So even proper char chunks may end up splitting a combining character from the one it’s modifying.


#16

Yes! (That’s why it’s named char_chunks, which makes it very upfront about that.) Grapheme chunks comes next, I guess?

It looks like a neat &str into &strs splitter would be needed, something that keeps a take-while-like state.