Loops with accumulated state confuse me

I sometimes get confused writing a loop[1], when there is state that doesn't "line up" easily with the iteration elements. This is a general programming question, but it may be Rust-specific as well, since there may be specific iteration functions that help with this pattern. Let me show an example.

In this code, I loop through a sequence of strings and add their characters to a buffer, character by character. Each buffer should have one language, so when the language (a property of a character for this example's sake) changes, I need to process the buffer.

let mut buffer = Buffer::new();
let mut prev_lang: Option<Language> = None;
for s in &strs {
   for c in s.chars() {
      let lang = Some(char_language(c));
      if lang != prev_lang {
          process_buffer(&mut buffer);
          prev_lang = lang;
      }
      buffer.push_char(c)
   }
}
if !buffer.is_empty() {
    process_buffer(&mut buffer);
}

The main point of ugliness is that code from the loop body is then repeated after the loop. With one line (process_buffer(...)), it's not bad, but imagine more code, more conditions. Maybe I could add checks for the "last" state, like if is_last_str && is_last_char .... But that... smells to me.

Is there a pattern here that I can name? What is a way to think about this and clean it up?

I should mention this is performance sensitive, so extra allocation worries me. For example I don't want to preprocess it to create another Vec<String> by language and then loop over that.


  1. Maybe I shouldn't admit that... ↩︎

The easiest way to do this is to use Itertools::chunk_by:

use itertools::Itertools;

fn lang(&c: &char) -> u32 {
    c as u32
}

fn main() {
    let strs = ["aab", "bbc", "dda"];
    
    let chunks = strs
        .iter()
        .flat_map(|str| str.chars())
        .chunk_by(lang);
    for (lang, seq) in &chunks  {
        dbg!(lang, seq.count());
    }
}

outputs

[src/main.rs:12:9] lang = 97
[src/main.rs:12:9] seq.count() = 2
[src/main.rs:12:9] lang = 98
[src/main.rs:12:9] seq.count() = 3
[src/main.rs:12:9] lang = 99
[src/main.rs:12:9] seq.count() = 1
[src/main.rs:12:9] lang = 100
[src/main.rs:12:9] seq.count() = 2
[src/main.rs:12:9] lang = 97
[src/main.rs:12:9] seq.count() = 1

I also hate when this happens to me with any language, not only Rust. But the solution presented by @jdahlstrom is very elegant. Thanks!

Here is a way to do this with a single call site to process_buffer:

    let mut buffer = Buffer::new();
    let mut prev_lang = None;
    let mut chars = strs.iter().flat_map(|s| s.chars());
    loop {
        let next_char = chars.next();
        let lang = next_char.map(char_language);
        if lang != prev_lang {
            process_buffer(&mut buffer);
            prev_lang = lang;
        }
        let Some(c) = next_char else { break };
        buffer.push_char(c);
    }

While @jdahlstrom’s solution is a really elegant one, I would like to point out a more general perspective.

This is the first half of the problem, and the simpler one.

Basically, this is a nested sequence: a sequence of sequences, an array of arrays, two nested loops, or a generator function returning generators. There are many manifestations of this concept.

They all share the idea of flattening: somewhere in the program, you only see the inner item. In your example, that is the body of the nested for loop. That is one of many possible solutions, and it is perfectly fine.

This is the second half of the problem.

“When it changes” means that you are effectively processing the sequence in pairs. There is some function or logic that takes two elements from the (abstract) sequence as parameters.

And at the end, your implicit requirement is that this function or logic should still be called with the last element of the sequence as the first parameter, even though there is obviously no element available for the second parameter. That missing thing is the root cause of the awkward repeated if statement in your example.

A very old technique, with its origins, I think, in list processing, is the introduction of sentinel values at the end (and sometimes the beginning) of a sequence. These values can be concrete instances of the original element type, or special values.

An extremely widely adopted sentinel value, for example, is the terminating NULL byte in strings in many programming languages. It is so common that virtually nobody realizes that the NULL is a sentinel. :wink:

In a language like Rust, you can define better sentinel values by using enums, or simply use a pre-made enum like Option. So, given a (flattened) sequence, you add sentinels and then process it.

A more general example:

fn main() {
    let sequence = [1, 2, 3];
    let iter = sequence.iter().map(Some).chain([None]);
    let mut last = None;
    for item in iter {
        match (last, item) {
            (None, Some(x)) => println!("first: {x}"),
            (Some(a), Some(b)) => println!("pair: ({a}, {b})"),
            (Some(x), None) => println!("last: {x}"),
            _ => unreachable!(),
        }
        last = item;
    }
}

And more concretely for your example (note: your original code processed an empty buffer at the start; this one does not):

fn process_buffer(buf: &mut Vec<char>) {
    println!("process_buffer: {buf:?}");
    buf.clear();
}

fn char_language(c: char) -> u32 {
    c as u32
}

fn main() {
    let strs = ["aab", "bbc", "dda"];
    let chars = strs.iter().flat_map(|s| s.chars());
    let mut buf = Vec::new();
    let mut prev_lang = None;
    for c in chars.map(Some).chain([None]) {
        let lang = c.map(char_language);
        if lang != prev_lang && prev_lang.is_some() {
            process_buffer(&mut buf);
        }
        if let Some(c) = c {
            buf.push(c);
        }
        prev_lang = lang;
    }
    assert!(buf.is_empty());
}

The concept of sentinels can be generalized even further. Different sentinel-like values can be inserted into a sequence at different points to carry information forward to subsequent operations on that sequence.

I'm going to use this in a few places. Thanks.

One issue I have is that my chunk_by key is more expensive to compute, but the there is a faster way to prove it didn't change. Something like:

fn get_key_for_chunking(elt: &X) -> Color {
   let id = get_id(elt); 
   // If `id` is the same as the previous elt, we know the color hasn't changed.
   // But calling compute_styles to get the color is more expensive
   let styles = compute_styles(id);
   styles.color
}

At first, rather than trying to create a custom variation of chunk_by that takes two functions, I will think about ways to memoize that more expensive part (compute_styles).

Just chunk by ids, then.
Later, in loop over chunks, compute your key (color) from id.

I can't; they aren't equivalent. In other words, if an ID hasn't changed, then I know a color hasn't changed. But if an ID changes, the color could still be the same.

One of such ways is a mutable closure, capturing Option<(Id, Color)> to store the last request it has done. @tczajka's solution looks somewhat shorter but less clear on the first sight (type annotations change that).

That makes things more complicated. But you still can chunk by id, and then chunk by color. You will end up with chunks of chunks, you will need to flatten inner chunks to get what you need.

Doesn't itertools::chunk_by allocate in order to create the chunks?

From the itertools::chunk_by docs

If the groups are consumed in order, or if each group’s iterator is dropped without keeping it around, then ChunkBy uses no allocations. It needs allocations only if several group iterators are alive at the same time.

Or does it iterate through the containers multiple times, once for each chunk?

according to the quote you just shared, it does not allocation

here the groups are clearly consumed in order

Yeah, sorry. I missread and missunderstood what chunk_by does.
I thought it would do something like a group_by operation in pandas. That is probably why they renamed the funtion from group_by to chunk_by. I still failed to understand it​:joy: