Optimizing recursion with manually managed stack

See https://github.com/ImmutableBug/recursion_optimization for the code, and use cargo bench for results.

If I have some recursive problem (e.g., common in dynamic programming), I believed that manually managing the stack would be an optimization. However, that doesn't seem to be the case - when benchmarked, the trivial recursive solution is faster. Is there any better way I could be doing this?

I've tested with a few different problems, including a much more complicated one (where I first noticed this). In general, I pretty consistently see a slowdown; in this case, cargo bench shows a 10% difference. I have a real situation where the recursion depth is a barrier as well. Ideally, there would be some generator magic to avoid recursion and avoid needing to build the state machine by hand, and those generators could just be stored in a vec (as a stack), since they're all the same size. However, I was having lots of issues getting it to work. Just for fun, I also included an implementation using futures; while this is by far the prettiest, it wastes CPU time heap allocating each generator individually, rather than (for example) putting them all in the same vec.

In src/lib.rs:

  • foo1 is the naive implementation, and also the fastest
  • foo2 is the manually managed implementation, slightly slower
  • foo3 is the implementation using futures - very pretty, but slowest

Note that your futures-based version is also recursive. The call to poll on the future, will internally call poll on each .awaited future, recursively.

Hmm. Didn't know that. Which is a shame, since I care about that one working more than the hand-crafted one.

Intuitively, I know what I want to happen - a vec of generators (fixed size, to avoid allocating, since each generator must be pinned), and at each step, the top generator is inspected, transitions to the next state, removed if necessary, children spawned if necessary, repeat. While I'm fairly sure this is sound, I really don't know how to begin to code this up - I don't know how to communicate that the vec won't resize, so the generator is safe to pin, I don't know how to write the vector type since the generator is anonymous, etc.. I might fight with it more and try to post a non-working example.

The way to communicate it is most likely to use Pin::new_unchecked and just promise it to the compiler, with the consequence of undefined behaviour if you break your promise. That said, it may be possible to do something with Box<[T]>.

I don’t know from futures, but I took a look at the code for the iterative version.

I’m not sure exactly why it’s slower (maybe something to do with the extra match on every loop), but usually with a dynamic programming approach we would want to build up from the base case.

So you might choose an array of size x*y as your cache, start at 0,0 and iterate somehow like for row in 0..y { for col in 0..x { ... } } so that by the time you get to any row,col position the left, up-left, and up neighbors are already initialized. Then there’s no need for a hashmap and no need to check whether a previously computed solution is available (the cache holds the value for rowi,colj at index rowi*x+colj).

Yep! Even mentioned that in the code. I'm trying to find a pattern when that isn't the case, because it isn't always possible, and in particular isn't possible on my much more complicated real-world case.

Oh my bad. I think if you can’t pose your problem in a bottom-uppy way then dynamic programming isn’t a great fit since it’s based on dividing and conquering subproblems. It might even be worth doing extra “unnecessary” work if you can approximate the set of subproblems somehow.

I see how the futures idea is promising since it sounds like you want a task queue, and I see that’s what you have in your iterative version. One way to generalize might be to define two functions: one that tells you what recursive calls you need for a particular input, and then one to call once those inputs are ready. Then you can iteratively call that first function until you know all the arguments you need (possibly with a dependency relationship but hopefully not, or maybe topologically sorted), and then evaluate those in order.

I managed to get foo2 to (and slightly below within standard derivation) foo1:

foo1: 120ms ± 2ms
foo2: 115ms ± 6ms

StackState::InitialRun got replaced with an unfolded recursive function and is called directly instead of being placed on the stack.

Cache is now a Vec mapping (x, y) to (x * N + y) and 0 to None

The main problem with foo2 was that the match was translated into a jump table with InitialRun being 50% of the cases. Catching that case in an if let before the match improved performance quite a bit. Inlining and removing it allowed to match foo1.

Ahh! That's so clever! Good catch :smiley:

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.