Best Way to Slice a Unicode String while Iterating

I and my friend writing a project and we need to slice strings that might contain non-ASCII characters and we need a way to slice them while iterating over them without counting the same byte-indices again in every iteration. So we came up with a solution like this:

use std::str::CharIndices;

pub struct StringSlicer<'a> {
    string: &'a String,
    char_iter: CharIndices<'a>,
    last: usize,
    curr: usize,
}

impl<'a> StringSlicer<'a> {
    pub fn new(string: &String) -> StringSlicer {
        StringSlicer {
            string,
            char_iter: string.char_indices(),
            last: 0,
            curr: 0,
        }
    }

    pub fn take_part(&mut self) -> &str {
        // `self.last` and `self.curr` are certainly in the bounds so there is no need to check
        let s = unsafe { self.string.get_unchecked(self.last..self.curr) };
        self.last = self.curr;

        s
    }
}

impl<'a> Iterator for StringSlicer<'a> {
    type Item = char;

    fn next(&mut self) -> Option<Self::Item> {
        if let Some((i, j)) = self.char_iter.next() {
            self.curr = i;
            Some(j)
        } else {
            None
        }
    }
}

fn main() {
    let s = "merhaba dünya".to_string();
    let mut w = StringSlicer::new(&s);
    // for can not be used here because both `take_part` and `.next` borrows `w`
    while let Some(i) = w.next() {
        if i == 'a' {
            println!("{}", w.take_part());
        }
    }
}

Of course, we could write a code like this:

fn main() {
    let word = "merhaba dünya".to_string();
    let mut last = 0;
    for (i, c) in word.char_indices() {
        if c == 'a' {
            println!("{}", unsafe { word.get_unchecked(last..i) });
            last = i;
        }
    }
}

However, we may pass StringSlicer instances to functions as arguments. So we want to store the state in a StringSlicer. But we are still not sure whether is it the best approach. So we would like to know your thoughts.

Doesn't this code basically just gets the previous character? Couldn't you use .chars().peekable() instead?

No, it is not. Even this particular example doesn't produce that output.

By the way, comparing c with character 'a' in the loop body was an example. We will do some other stuff. The main idea is iterating over a Unicode string and slicing it at the same time, but not calculating char-indices in every iteration. So we don't want to use something like s.chars().skip(start).take(len).

Oh, that’s not what you would be doing if you’re working with the indices you get from .char_indices. The .char_indices() iterator procudes byte offsets for each char. These indices can be used for e.g. slicing the string like s[start..start+len] in constant time, whereas your .chars().skip... would not need byte offsets but char offsets (i.e. counting the number of characters/codepoints; i.e. what you’d get from .chars().enumerate()) and also .chars().skip... does not run in constant time.

I was talking about we are not considering s.chars().skip(start).take(len) an alternative to our approach because it recomputes char indices in every iteration. I wasn't talking about using them together. I think you misunderstood my explanation. Maybe I was not very clear.

It's more flexible to store a &str than a &String.

Sometimes just creating a new iterator (from a substring slice) is nicer than carrying one around, but that may not be the case here.

1 Like

You are right! We better change there.

One thing I’m noticing is that you never return the last character of the string. E.g.:

fn main() {
    let s = "merhaba dünya".to_string();
    let mut w = StringSlicer::new(&s);
    for _ in &mut w {}
    println!("{}", w.take_part()); // missing the final `a` in the output
}

If you want to avoid this, you might want to insert something like

impl<'a> Iterator for StringSlicer<'a> {
    type Item = char;

    fn next(&mut self) -> Option<Self::Item> {
        if let Some((i, j)) = self.char_iter.next() {
            self.curr = i;
            Some(j)
        } else {
            // vvvv  this line  vvvvvvvvvv
            self.curr = self.string.len();
            None
        }
    }
}
2 Likes

Thank you for correction!