Iterator over windows of chars

derekdreery · June 2, 2018, 9:48am

I want to iterate over substrings like this

let whole_str = "some string";

// I want to do this
for substr in whole_str.substrs(2) {
    //...
}

// and get this
for substr in ["so", "om", "me", "e ", ...] {
    // ....
}

is there a way to do this in std lib, or are there any crates that do this - I couldn't find any.

If not, is this something that could be added to the std lib?

EDIT just to make it clear it would work like slice::windows, so the question could be phrased as: Is there an equivalent to slice::windows for str, that takes into account the fact that chars may be more than 1 byte?

derekdreery · June 2, 2018, 1:07pm

I published a crate for it.

Not the most efficient implementation but does the job.

trentj · June 2, 2018, 1:27pm

Nice! I had an answer partially finished which I'm going to go ahead and post. You may or may not want to use parts of it.

Slightly adapted from an answer to this Stack Overflow question:

fn str_windows(line: &str, n: usize) -> impl Iterator<Item = &str> {
    line.char_indices()
        .zip(line.char_indices().skip(n).chain(Some((line.len(), ' '))))
        .map(move |((i, _), (j, _))| &line[i..j])
}

Especially since you've made a crate of it, I feel it's more usual to make behavior like this (which extends the capabilities of a type) part of a trait, and implment it for str.

But I always feel the need to point out the caveats when dealing with chars, because if you're not careful you might accidentally be invaded by Bulgaria.

let whole_str = "🇬🇧🇬🇧🇬🇧";

for substr in substrs(whole_str, 2) {
    print!("{} ", substr);
}

The above prints 🇬🇧 🇧🇬 🇬🇧 🇧🇬 🇬🇧. (For those with limited font rendering capability, these are flag emoji for the UK and Bulgaria. Input looks like UK-UK-UK and output is UK-Bulgaria-UK-Bulgaria-UK)

Iteration by grapheme clusters is usually the way to go for general purpose text wrangling, but that requires an external dependency:

extern crate unicode_segmentation;

use unicode_segmentation::UnicodeSegmentation;

fn str_windows(line: &str, n: usize) -> impl Iterator<Item = &str> {
    line.grapheme_indices(true)
        .zip(line.grapheme_indices(true).skip(n).chain(Some((line.len(), ""))))
        .map(move |((i, _), (j, _))| &line[i..j])
}

(playground)

There may be a way to make unicode-segmentation an optional dependency of your crate, and have e.g. str_windows_graphemes when it's available. I've never published a crate, so I don't know.

derekdreery · June 2, 2018, 2:25pm

Thanks for the (very useful) help!

I agree about the ol' grapheme clusters, my use case is looking for different specific ascii strings - I just pass other strings though.

Topic		Replies	Views
Windows method for &str? Is &str a slice? help	7	2362	April 26, 2020
Is there a better way to get windows of chars from &str? code review	8	826	March 11, 2023
Rust substring function?	8	14928	July 3, 2022
Losing std::str::Chars::as_str help	5	1954	November 10, 2019
Best Way to Slice a Unicode String while Iterating code review	10	461	February 22, 2021

Iterator over windows of chars

Related Topics