How to turn this callback-based code into an iterator? (Without giving up performance!)

I've been working on optimizing the performance my Advent of Code 2023 solutions. Currently I'm working on Day 17, at the center of which is an A* implementation. I'm interested in trying some different shortest path algorithms (e.g. bi-directional A*), so I want to separate the specifics of the Day 17 code from the shortest path code. This brings us to the situation mentioned in the title:

I have a neighbors function that should in some way return the neighbors of a given state. Initially I implemented it to return an iterator:

But this was slower than my original implementation (7.3ms vs 5.7ms). So I tried a more imperative approach, by passing a callback to the neighbors() function instead of returning an iterator:

This was equally fast as my original solution (both are 5.7ms).

I can't figure out why the iterator based approach is slower than the imperative callback-based approach. The code is also too complex to paste into godbolt to figure out that way. I did use cargo-show-asm to inspect and compare the assembly of both solutions, and noticed that the result for the iterator approach was much larger (~1.6x as many lines of output).

I was really just hoping I could get some pointers on how this could be implemented with iterators, without handing in on performance.

You should test if it makes a difference if you rewrite the use-site

into using .for_each method instead of a for loop. These tend to be implemented more efficiently for some iterators (probably including the kinds you use, i.e. IIRC for example flat_map would also benefit; chain is the most canonical example - in general it’s iterators where the next call would need to trace back “into” some more complex context of where exactly the iterator currently is), where repeated calls to next can’t quite reach the efficiency of the callback style of for_each.

(Remember that for loops desugar into a while loop calling next.)

In case you needed early exit, try_for_each is an option, too (doesn’t look like you use any early exit from the loop, though, and then for_each can be even slightly more efficient & simple for some iterators than try_for_each).

Edit: Upon reinspecting the code, I’m also noticing that ..= range iterator is involved. That range also is a type infamous for having an inefficient next method on its iterator. The problem stems from the handling of the corner case when the right-hand-side bound happens to land on the maximal value of the integer type in question, and for that purpose, it includes an extra flag whether the end is reached. That flag would need to be checked on every single .next call then to see if it’s (past) the last iteration.

The two possible approaches to lift this issue are

  • on one hand, usage of methods like for_each or try_for_each or the respective fold or try_fold methods; those can handle the edge-case only once throughout the whole iteration, which is a lot cheaper than doing it for every single element
  • on the other hand, usage of .. ranges. If you know that you won’t have super large integers (i.e. it’s never literally the maximal integer value possible), then you can rewrite your x..=y into x..1+y and the iterator might become significantly more efficient.

So, assuming neither max_steps nor steps_to_edge can ever be u32::MAX, feel free to benchmark if it makes a difference to just do that re-write from (self.min_steps..=self.max_steps.min(steps_to_edge)) into (self.min_steps..1+self.max_steps.min(steps_to_edge)) here

1 Like

Thanks so much! I had actually found both those suggestions while searching for a solution on this forum, and had tried them individually to no avail. But now I've applied both at the same time, my benchmark is showing equivalent times to before again (perhaps even slightly faster).

Edit: for completeness' sake, here is the commit with the changes I ended up making: Day 17: Extract A* into separate module by Mesoptier · Pull Request #11 · Mesoptier/aoc-2023 · GitHub