Optimizing Fast Fibonacci Computation

Hi all,

I'm a Rust newbie that's been fooling around with Fibonacci's numbers as a starting point. I have included my implementation to compute Fibonacci numbers quickly for very large indices below. It runs in about 12 seconds for n = 10^9 (on the optimized release build). It's essentially a Rust translation of the Dijikstra's recurrence equation with memoization as implemented by mhw here. The Julia implementation, benchmarked on the same computer with n = 10^9, runs in about 9 seconds. The results are consistent across multiple runs.

With that in mind, my next goal is to optimize the implementation further so I can at least get runtimes that are on par if not better than Julia's. From the look of things it's clear that the Rust implementation is at least on the same order of magnitude in terms of speed. What's left to improve is likely minor mechanical inefficiencies. My thoughts on this right now:

  1. The Rust version is doing an extra hash than what is absolutely necessary due to first checking the existence of the key, then hash again to get the value (if it exists) or hash again in order to insert the new value. I would like to modify this to use the entry.or_insert() pattern to avoid the extra hash. However so far I've always ended up losing the fight against the borrow checker.

  2. Maybe we can avoid some clones?

  3. Maybe we can avoid moving the cache around all the time and instead use borrow? This I am doubtful as the implementation is recursive. Each recursive call must use mutable borrow since new values can potentially be inserted, but you can't have multiple mutable borrows.

I probably missed a trillion other things. Any suggestions on this? Thanks!

use rug::Integer;
use rug::ops::Pow;
use std::collections::HashMap;


pub fn fast_fibonacci(target_n: usize) -> Integer {
    let cache: HashMap<usize, Integer> = HashMap::new();
    let (result, _) = fib_dijk(target_n, cache);
    return result
}


fn is_even(n: usize) -> bool {
    return n & 1 == 0
}


fn fib_dijk_helper(target_n: usize, cache: HashMap<usize, Integer>) -> (Integer, HashMap<usize, Integer>) {
    if target_n <= 1 {
        return (Integer::from(target_n), cache)
    } else {
        let half_n = target_n >> 1;
        let (x, cache) = fib_dijk(half_n, cache);
        let x_2 = Integer::from((&x).pow(2));
        if is_even(target_n) {
            let (y, cache) = fib_dijk(half_n-1, cache);
            let result = 2*x*y + x_2;
            return (result, cache)
        } else {
            let (z, cache) = fib_dijk(half_n+1, cache);
            return (x_2 + z.pow(2), cache)
        }
    }
}

fn fib_dijk(target_n: usize, cache: HashMap<usize, Integer>) -> (Integer, HashMap<usize, Integer>) {
    if cache.contains_key(&target_n) {
        return (cache.get(&target_n).unwrap().clone(), cache)
    } else {
        let (result, mut cache) = fib_dijk_helper(target_n, cache);
        cache.insert(target_n, result.clone());
        return (result, cache)
    }
}

Actually, I'm pretty sure this would just work if you did it in the naive way since the multiple mutable references would overlap in a way that is allowed.

This is probably because you are moving the cache around by ownership. I'm pretty sure this will just work if you pass a mutable reference instead.

  1. Instead of .contains() and then .get().unwrap(), a very idiomatic and Rusty pattern is to

    if let Some(entry) = cache.get(&target_n) {
        return (entry.clone(), cache);
    } else …
    

    But at this point you may encounter a false-positive error of the borrow checker. If that's the case, the Entry::or_insert_with API comes in very handy :slight_smile:

  2. Those clones in the caching function look sensible to me.

  3. I suspect using a borrow should be possible, if only because you can also treat a &mut borrow as a "value type on its own", and your code should still work. The advantage is that the value being moved will be smaller, the disadvantage is that it introduces more indirection, so you should profile that change.

1 Like

This change passes the cache by reference, and eliminates one of the redundant HashMap lookups (but not both). It doesn't make any measurable performance difference on my system, but it does simplify the code a little.

--- a/src/main.rs
+++ b/src/main.rs
@@ -4,9 +4,8 @@ use std::collections::HashMap;
 
 
 pub fn fast_fibonacci(target_n: usize) -> Integer {
-    let cache: HashMap<usize, Integer> = HashMap::new();
-    let (result, _) = fib_dijk(target_n, cache);
-    return result
+    let mut cache: HashMap<usize, Integer> = HashMap::new();
+    fib_dijk(target_n, &mut cache)
 }
 
 
@@ -15,30 +14,29 @@ fn is_even(n: usize) -> bool {
 }
 
 
-fn fib_dijk_helper(target_n: usize, cache: HashMap<usize, Integer>) -> (Integer, HashMap<usize, Integer>) {
+fn fib_dijk_helper(target_n: usize, cache: &mut HashMap<usize, Integer>) -> Integer {
     if target_n <= 1 {
-        return (Integer::from(target_n), cache)
+        Integer::from(target_n)
     } else {
         let half_n = target_n >> 1;
-        let (x, cache) = fib_dijk(half_n, cache);
+        let x = fib_dijk(half_n, cache);
         let x_2 = Integer::from((&x).pow(2));
         if is_even(target_n) {
-            let (y, cache) = fib_dijk(half_n-1, cache);
-            let result = 2*x*y + x_2;
-            return (result, cache)
+            let y = fib_dijk(half_n-1, cache);
+            2*x*y + x_2
         } else {
-            let (z, cache) = fib_dijk(half_n+1, cache);
-            return (x_2 + z.pow(2), cache)
+            let z = fib_dijk(half_n+1, cache);
+            x_2 + z.pow(2)
         }
     }
 }
 
-fn fib_dijk(target_n: usize, cache: HashMap<usize, Integer>) -> (Integer, HashMap<usize, Integer>) {
-    if cache.contains_key(&target_n) {
-        return (cache.get(&target_n).unwrap().clone(), cache)
+fn fib_dijk(target_n: usize, cache: &mut HashMap<usize, Integer>) -> Integer {
+    if let Some(result) = cache.get(&target_n) {
+        result.clone()
     } else {
-        let (result, mut cache) = fib_dijk_helper(target_n, cache);
+        let result = fib_dijk_helper(target_n, cache);
         cache.insert(target_n, result.clone());
-        return (result, cache)
+        result
     }
 }

Or you can optimize both the Rust and the Julia code to use sliding window, which only remembers the last 2 numbers of the sequence to calculate the next one. Also with this approach you can reuse the allocation of the rug::Integer.

1 Like

Hey thanks. I don't know why changing things to borrow didn't work before for me. Probably because I was also trying to use entry.or_insert() pattern and things blew up inside.

@Hyeonu Are you talking about caching fib(n-1) and fib(n-2) in order to compute fib(n)? That's the iterative method and it is WAY slower (O(N)) than Dijk's recurrence algorithm (O(logN)) shown here.

You can't calculate the fibonacci sequence in O(log n) time as you need to calculate every sequences between fib(1) and fib(n) to calculate fib(n + 1).

No that's provably false. Take a look at the helper function (also mhw's post). Dijk's algorithm is essentially an optimized form of the matrix exponentiation method (which is also O(logN)). That's where the magic happens.

Oh I missed the part. Definitely interesting approach.

It seems the iterative approach still is possible by calculating only fib(2^n)s, skipping the caches entirely. Haven't tried myself yet but would be cool experiment.

That is provably false.

My C++ Fibonacci produces the million digits of fibo(4784969) in 0.39 seconds on one core of a Mac M1.

That is slow compared to those that know what they are doing.

That's nice. But my question is really "is there something mechanical I'm doing wrong here that's making this code slower than the Julia implementation?". If the suggestion is to just implement it using BINET formula or call GMP then no offense but that's kinda missing the point.

The stdlib's default hasher is secure but somewhat slow especially for the small input like usize. Try replace the hashmap with the fnv or fxhash crate.

Hmm this is interesting. I did notice that hash performance does differ quite a bit between some languages (Python is surprisingly fast there for example). I'll try it out.

Edit: neither seemed to have made a tangible difference in this case. :frowning:

No offence taken. No, that is not the suggestion.

The code I linked to does not use GMP or any such big number maths library. It's all just C++ code written by myself.

In outline:
Big integers are stored as vectors of 64 bit unsigned integers.
Each element of those vectors is a "digit" of the number in base 1000000000.
The Karasuba algorithm is used to perform fast multiplication of those numbers: Karatsuba algorithm - Wikipedia
There is a "small number optimisation" in that numbers that can be represented in 16 or so of such "digits" are kept in an array on the stack rather than in the vector on the heap.
The actual Fibonacci is calculated using the "Fast Doubling" algorithm as described here: Fast Fibonacci algorithms

This is probably not something you will want to waste time tackling. But my repo does need a fibo(4784969) solution in Rust. As you might have noticed there are solutions in many other languages there. Written by many people in response to a challenge to calculate the million digit fibo in ones favourite language, without using any libraries other than what came out of the box as standard with the language.

A kind of test of coding chops and language performance.

1 Like

My apologies. I thought you wanted me to call C++ from Rust to beat Julia's time haha.

Unfortunately this particular piece of code won't qualify since it makes use of external crates for BigInt. However if someone with a pure homebrew Rust implementation of BigInt wants to take this and marry them together for a submission please feel free to do so.

No worries.

Actually I owe you an apology. Now that I look at you code in the first post properly I see that you are using some version of the "Fast Doubling" algorithm.

This is interesting. The gold standard form for fast fibo is the Fibonacci function provided by GMP. It is (was) the fastest I have found so far. For fibo(4784969) in C on my MacBook Pro M1 I get:

$ time ./fibo
...
...9211500699706378405156269
./fibo  0.10s user 0.01s system 81% cpu 0.127 total

For a fast doubling algorithm in C, similar to yours but with no caching, I get:

$ time ./fibogmp
...9211500699706378405156269
./fibogmp  0.40s user 0.02s system 75% cpu 0.560 total

My C++ effort, no GMP but homebrew bigint maths gets:

$ time ./fibo_karatsuba
...9211500699706378405156269
./fibo_karatsuba  0.39s user 0.01s system 95% cpu 0.408 total

For your Rust code I get:

$ time ./target/release/fibo_fengkehh
9211500699706378405156269
./target/release/fibo_fengkehh  0.10s user 0.00s system 81% cpu 0.131 total

Which makes your Rust implementation as fast as the GMP gold standard fibo function.

Now what was it you wanted to optimise again?

:slight_smile:

Hmm this is interesting. I mean I knew it was decently fast but didn't think it would be this fast. However it should be taken with a giant grain of salt as it's using Integer from rug which I believe links to GMP, MPFR etc for its large number implementation. If BigInt operations are often the speed bottleneck this is certainly cheating.

If it's not too much trouble, can you bench the Julia code from mhw in the first post as well? My main gripe with my code is really that it's slower than Julia by 3 to 4 seconds when n = 10^9, with the latter being significantly easier to code to boot. Julia also uses GMP for its BigInt so at least I think comparison between the two is fair.

We have to be careful what we are timing here.

Unfortunately I'm using a MacBook Pro M1 and there is no Julia available for it's Arm processor yet. Unless I build it from source. So I have to run it under "Rossetta" which does a translation from x86 to Arm on the fly. This is likely to be slower than a true Arm build.

A ran that Julia code for fibo(4784969) with printing of the million digit result.
I got timings like this:

julia fibo.jl  0.46s user 1.17s system 370% cpu 0.440 total
./fibo_fengkehh  0.10s user 0.01s system 80% cpu 0.135 total

Which indicates your Rust is quicker. Modulo the rosetta translation issue.

Without printing the results I get timings like:

julia fibo.jl  0.25s user 0.53s system 262% cpu 0.299 total
./fibo_fengkehh  0.04s user 0.00s system 95% cpu 0.046 total

Which makes your Rust even faster !

Or you could just avoid hashing altogether and store the numbers in a Vec instead. I don't know whether this version actually has the same recurrence structure as the naïve definition, but my gut instinct is that a sporadic Vec::resize() and a comparison to a sentinel value, e.g. 0, would be way faster than any sort of hashing scheme.

2 Likes

If I'm not mistaken, the rug crate acts as a higher level frontend for GMP, so the fact that its performance is similar is not surprising.

I have actually sent you a pull request for a Rust version of fibo_4784969, heavily inspired by your C++ version. It can by now use a number of different backends (ibig, gmp, num_bigint). If I'm not mistaken, using Eric Olson's doubling formulae, you will not benefit from a cache because at every step in the recursion you only need to recurse with half the previous index.

My implementation can be found on GitHub. I have not yet added a rug implementation your algorithm, and still need to do a better analysis, but comparing your algorithm using the gmp crate with @fengkehh 's caching algorithm above shows that the total running time using your algorithm is slightly lower than using the above Dijkstra algorithm (0.14 vs 0.15 s). Given that I expect that by far most of the running time is in printing out the result, I think your algorithm is quite a bit faster in the computation. Though again, I still need to check this.

EDIT: ok, the difference is a bit smaller than I expected, but measurable:

~> ./target/release/fibo_4784969 -b gmp > fibo.out
computing F(4784969): 0.049s
printing F(4784969): 0.093s

~> ./target/release/fibo_4784969 -b rug2 > fibo.out
computing F(4784969): 0.064s
printing F(4784969): 0.094s