Optimizing recursion with manually managed stack

ImmutableBug · July 13, 2020, 3:13pm

See https://github.com/ImmutableBug/recursion_optimization for the code, and use cargo bench for results.

If I have some recursive problem (e.g., common in dynamic programming), I believed that manually managing the stack would be an optimization. However, that doesn't seem to be the case - when benchmarked, the trivial recursive solution is faster. Is there any better way I could be doing this?

I've tested with a few different problems, including a much more complicated one (where I first noticed this). In general, I pretty consistently see a slowdown; in this case, cargo bench shows a 10% difference. I have a real situation where the recursion depth is a barrier as well. Ideally, there would be some generator magic to avoid recursion and avoid needing to build the state machine by hand, and those generators could just be stored in a vec (as a stack), since they're all the same size. However, I was having lots of issues getting it to work. Just for fun, I also included an implementation using futures; while this is by far the prettiest, it wastes CPU time heap allocating each generator individually, rather than (for example) putting them all in the same vec.

In src/lib.rs:

foo1 is the naive implementation, and also the fastest
foo2 is the manually managed implementation, slightly slower
foo3 is the implementation using futures - very pretty, but slowest

alice · July 13, 2020, 3:37pm

Note that your futures-based version is also recursive. The call to poll on the future, will internally call poll on each .awaited future, recursively.

ImmutableBug · July 13, 2020, 3:54pm

Hmm. Didn't know that. Which is a shame, since I care about that one working more than the hand-crafted one.

Intuitively, I know what I want to happen - a vec of generators (fixed size, to avoid allocating, since each generator must be pinned), and at each step, the top generator is inspected, transitions to the next state, removed if necessary, children spawned if necessary, repeat. While I'm fairly sure this is sound, I really don't know how to begin to code this up - I don't know how to communicate that the vec won't resize, so the generator is safe to pin, I don't know how to write the vector type since the generator is anonymous, etc.. I might fight with it more and try to post a non-working example.

alice · July 13, 2020, 3:56pm

The way to communicate it is most likely to use Pin::new_unchecked and just promise it to the compiler, with the consequence of undefined behaviour if you break your promise. That said, it may be possible to do something with Box<[T]>.

JosephOsborn · July 13, 2020, 3:59pm

I don’t know from futures, but I took a look at the code for the iterative version.

I’m not sure exactly why it’s slower (maybe something to do with the extra match on every loop), but usually with a dynamic programming approach we would want to build up from the base case.

So you might choose an array of size x*y as your cache, start at 0,0 and iterate somehow like for row in 0..y { for col in 0..x { ... } } so that by the time you get to any row,col position the left, up-left, and up neighbors are already initialized. Then there’s no need for a hashmap and no need to check whether a previously computed solution is available (the cache holds the value for rowi,colj at index rowi*x+colj).

ImmutableBug · July 13, 2020, 4:00pm

Yep! Even mentioned that in the code. I'm trying to find a pattern when that isn't the case, because it isn't always possible, and in particular isn't possible on my much more complicated real-world case.

JosephOsborn · July 13, 2020, 4:17pm

Oh my bad. I think if you can’t pose your problem in a bottom-uppy way then dynamic programming isn’t a great fit since it’s based on dividing and conquering subproblems. It might even be worth doing extra “unnecessary” work if you can approximate the set of subproblems somehow.

I see how the futures idea is promising since it sounds like you want a task queue, and I see that’s what you have in your iterative version. One way to generalize might be to define two functions: one that tells you what recursive calls you need for a particular input, and then one to call once those inputs are ready. Then you can iteratively call that first function until you know all the arguments you need (possibly with a dependency relationship but hopefully not, or maybe topologically sorted), and then evaluate those in order.

s3bk · July 14, 2020, 8:01am

I managed to get foo2 to (and slightly below within standard derivation) foo1:

github.com

s3bk/recursion_optimization/blob/b3832592e6a2fcc59f84ae235fd7f14a7d201740/src/lib.rs#L53


      
                      % 1000;
                  // save our result and return
                  cache.insert((x, y), tr);
                  tr
              }
          }
          
          
// This second implementation is "optimizing" by manually managing the stack.
          // This avoids recursion by storing the stack on the heap in the form of a
          // vector, and explicitly stating what to store between recursive calls
          pub fn foo2(x: u32, y: u32) -> u32 {
              // store x, y, and any previously computed recursive result
              enum StackState {
                  FirstRec(u32, u32),
                  SecondRec(u32, u32, u32),
                  ThirdRec(u32, u32, u32, u32),
              }
              let mut stack = Vec::with_capacity((x + y) as usize);
              // this return value is used by the child to communicate the result back up
              let mut rv = 0;
              // same cache as before

foo1: 120ms ± 2ms
foo2: 115ms ± 6ms

StackState::InitialRun got replaced with an unfolded recursive function and is called directly instead of being placed on the stack.

Cache is now a Vec mapping (x, y) to (x * N + y) and 0 to None

The main problem with foo2 was that the match was translated into a jump table with InitialRun being 50% of the cases. Catching that case in an if let before the match improved performance quite a bit. Inlining and removing it allowed to match foo1.

ImmutableBug · July 14, 2020, 1:28pm

Ahh! That's so clever! Good catch

system · October 12, 2020, 1:28pm

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Recursive lazy list help	5	1010	August 20, 2020
Futures-rs compile time and memory usage help	7	1209	January 12, 2023
[WASM/Rust] Multithreading / stack size control help	12	1510	January 12, 2023
Seems that optimizer has issues with closures community	6	743	April 28, 2020
How to recursively improve an iterator without needing to use a vec help	8	585	April 20, 2021

Optimizing recursion with manually managed stack

Related Topics