Iterator over windows of chars


#1

I want to iterate over substrings like this

let whole_str = "some string";

// I want to do this
for substr in whole_str.substrs(2) {
    //...
}

// and get this
for substr in ["so", "om", "me", "e ", ...] {
    // ....
}

is there a way to do this in std lib, or are there any crates that do this - I couldn’t find any.

If not, is this something that could be added to the std lib?

EDIT just to make it clear it would work like slice::windows, so the question could be phrased as: Is there an equivalent to slice::windows for str, that takes into account the fact that chars may be more than 1 byte?


#2

I published a crate for it.

Not the most efficient implementation but does the job.


#3

Nice! I had an answer partially finished which I’m going to go ahead and post. You may or may not want to use parts of it.

Slightly adapted from an answer to this Stack Overflow question:

fn str_windows(line: &str, n: usize) -> impl Iterator<Item = &str> {
    line.char_indices()
        .zip(line.char_indices().skip(n).chain(Some((line.len(), ' '))))
        .map(move |((i, _), (j, _))| &line[i..j])
}

Especially since you’ve made a crate of it, I feel it’s more usual to make behavior like this (which extends the capabilities of a type) part of a trait, and implment it for str.

But I always feel the need to point out the caveats when dealing with chars, because if you’re not careful you might accidentally be invaded by Bulgaria.

let whole_str = "🇬🇧🇬🇧🇬🇧";

for substr in substrs(whole_str, 2) {
    print!("{} ", substr);
}

The above prints 🇬🇧 🇧🇬 🇬🇧 🇧🇬 🇬🇧. (For those with limited font rendering capability, these are flag emoji for the UK and Bulgaria. Input looks like UK-UK-UK and output is UK-Bulgaria-UK-Bulgaria-UK)

Iteration by grapheme clusters is usually the way to go for general purpose text wrangling, but that requires an external dependency:

extern crate unicode_segmentation;

use unicode_segmentation::UnicodeSegmentation;

fn str_windows(line: &str, n: usize) -> impl Iterator<Item = &str> {
    line.grapheme_indices(true)
        .zip(line.grapheme_indices(true).skip(n).chain(Some((line.len(), ""))))
        .map(move |((i, _), (j, _))| &line[i..j])
}

(playground)

There may be a way to make unicode-segmentation an optional dependency of your crate, and have e.g. str_windows_graphemes when it’s available. I’ve never published a crate, so I don’t know.


#4

Thanks for the (very useful) help!

I agree about the ol’ grapheme clusters, my use case is looking for different specific ascii strings - I just pass other strings though.