Rust vs. C vs. Go runtime speed comparison

notoriously harder to optimize.

Sorry, I have currently no idea for the reason. I just asked GPT-4, but their explanations were not really convincing. Can you explain it in some more detail and maybe give an concrete example? A link to an explanation on internet or a textbook link would be fine as well.

See for example https://github.com/rust-lang/rust/issues/45222

The root problem is that inclusive ranges need to correctly handle the case where the end point is the max value for the current integer type, where simply doing the equivalent of an exclusive range with end + 1 would be incorrect. This leads to exclusive ranges having a flag to signal when they are exhausted and an additional branch ad codepath in their Iterator::next method. The optimizer then would have to hoist that path out of the loop by noticing it is executed only once in the last iteration. AFAIK it used to not be able to detect these cases, but maybe it has recently be improved.

11 Likes

Thank you very much for the detailed explanation, I will study it tomorrow carefully.

Actually, Google just found a related StackOverflow question: clang - Why does iteration over an inclusive range generate longer assembly in Rust than in C++? - Stack Overflow

After more fiddling with the code I realize that I still get the speed boost with sums.len()-1 if I comment println!("c={:?}", c); further down the code !?!!
Cf. repo.

At this stage it one must recognize the compiler ways are somewhat unpredictable/difficult to reckon - similar to @steffahn's above ("I give up; this may be weird side-effects of benchmarking random compilation effects. Look at this madness")

So for reference: It is good to keep in mind to try several variations for fear of a sub-optimal compilation...

That is a common problem. C and C++ suffer from it as well.

try swapping the arguments to cmp::min. I recall an issue where it mattered. Not sure if llvm has fixed that yet.

The thing of note here is that the cmp::min call is in the loop that is not the one that clearly dominates the run-time of the program.

I have a new (condensed) proposal for the two imbricated loops:

valid = (0..c-1).all(|i| !sums[i+1..c].contains(&sums[i]));

On my machine, it seems to be faster (and remains safe).

(ok, I just noticed that this is already proposed earlier but it is not in your perf. table)

other condensed version : valid = (0..c-1).all(|i| !sums[i+1..c].contains(&sums[i]));

@xophe good catch, it should be in the table indeed - as the fastest safe & most compact expression.

Repo updated: GitHub - oscar6echo/rust-c-go-speed

Table:

compiler opt_level runtime ratio vs. best
gcc -O3 1.54 1.0
rust release v2 unsafe 1.56 1.01
rust release v3 safe 2.2 1.43
rust release v3b safe 2.22 1.44
clang -O3 2.27 1.47
clang -O1 2.3 1.49
go 2.34 1.52
rust release v4 safe 2.62 1.7
gcc -O2 2.9 1.88
rust release v5 safe 2.89 1.88
gcc -O1 2.95 1.92
clang -O2 4.35 2.82
gcc 10.1 6.56
clang 11.53 7.49
rust debug v1 16.61 10.79

Side note:
I have added a parallel version of the algo. It is not in competition of course, but goes to show that it is (suprisingly) easy and efficient to go parallel (in this search algo case).

Repo updated.

Ref: The syntax is inspired by rustlings exercices rustlings/exercises/20_threads at main · rust-lang/rustlings · GitHub

Table:

compiler opt_level runtime ratio vs. best not parallel
rust release parallel unsafe 0.41 0.27
gcc -O3 1.54 1.0
rust release v2 unsafe 1.56 1.01
rust release v3 safe 2.2 1.43
rust release v3b safe 2.22 1.44
clang -O3 2.27 1.47
clang -O1 2.3 1.49
go 2.34 1.52
rust release v4 safe 2.62 1.7
gcc -O2 2.9 1.88
rust release v5 safe 2.89 1.88
gcc -O1 2.95 1.92
clang -O2 4.35 2.82
gcc 10.1 6.56
clang 11.53 7.49
rust debug v1 16.61 10.79

Before going into parallelization, I’d suggest improvements to the algorithm. As I called out, essentially all of its time is spent in a highly suboptimal, quadratic-time, duplicates check.

A trivial usage of a HashSet already makes the whole thing go 10x

valid = true;
let mut set = HashSet::with_capacity(c);
for &s in &sums[..c] {
    if !set.insert(s) {
        valid = false;
        break;
    }
}

An even faster representation of the set could get me about another 2-3x

pub fn main() {
    println!("start key-gen-face-five");

    // init
    let mut t: u32; // t=trial key, k=index searched key
    let mut c: usize; // sums counter
    let mut valid: bool; // true if key is valid

    let mut sums = [0; 50000]; // array of all possible sums of key[c[1-5]]

    // collects the set of values originally in `sums`
    // value `s` is in the set if the entry `set[s]` contains the current `t` value
    let mut set = Vec::new();

    let mut key = [0, 1, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]; // init keys - empirical

    println!("bootstrap -> keys={:?}", key);

    let start = Instant::now();
    t = 5;
    for k in 3..11 {
        println!("searching for k={}", k);
        // sanity check, `t` is always just increasing;
        // An always unique / fresh `t` value is important for correct representation of the
        // `set` array which thus allows us to skip any step of *clear*ing the set
        // since with incremented `t` all entries are considered empty again.
        assert_eq!(key[k - 1] + 1u32, t + 1);
        t = key[k - 1] + 1u32;
        let interm = Instant::now();
        valid = false;

        while !valid {
            key[k] = t;
            let c_max = k + 1;
            c = 0;

            for (c1, &k1) in zip(0.., key[0..c_max].iter()) {
                for (c2, &k2) in zip(c1.., key[c1..c_max].iter()) {
                    for (c3, &k3) in zip(c2.., key[c2..c_max].iter()) {
                        for (c4, &k4) in zip(c3.., key[c3..c_max].iter()) {
                            for (c5, &k5) in zip(c4.., key[c4..c_max].iter()) {
                                if c1 != c5 {
                                    sums[c] = k1 + k2 + k3 + k4 + k5;
                                    c += 1;
                                }
                            }
                        }
                    }
                }
            }

            valid = true;
            // conservative estimate of largest k1 + k2 + k3 + k4 + k5 value
            // which is always less than 5*t
            set.resize(t as usize * 5, 0);

            for &s in &sums[..c] {
                let e = &mut set[s as usize];
                if *e == t {
                    valid = false;
                    break;
                } else {
                    *e = t;
                }
            }

            if valid {
                println!("key[{}]={:?}", k, t);
                println!("c={:?}", c);
                let end = Instant::now();
                println!("\truntime for key[{}] = {:?}", k, (end - interm));
                println!("\truntime for key[{}] = {:?}", k, (end - start));
            } else {
                t += 1;
            }
        }
        // k += 1;
    }

    let end = Instant::now();
    println!("runtime = {:?}", (end - start));
}

and finally, integrating the duplicate detection with the loop that used to create sums is another 4-5x faster

pub fn main() {
    println!("start key-gen-face-five");

    // init
    let mut t: u32; // t=trial key, k=index searched key
    let mut c: usize; // sums counter
    let mut valid: bool; // true if key is valid

    // let mut sums = [0; 50000]; // array of all possible sums of key[c[1-5]]

    // collects the set of values originally in `sums`
    // value `s` is in the set if the entry `set[s]` contains the current `t` value
    let mut set = Vec::new();

    let mut key = [0, 1, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]; // init keys - empirical

    println!("bootstrap -> keys={:?}", key);

    let start = Instant::now();
    t = 5;
    for k in 3..11 {
        println!("searching for k={}", k);
        // sanity check, `t` is always just increasing;
        // An always unique / fresh `t` value is important for correct representation of the
        // `set` array which thus allows us to skip any step of *clear*ing the set
        // since with incremented `t` all entries are considered empty again.
        assert_eq!(key[k - 1] + 1u32, t + 1);
        t = key[k - 1] + 1u32;
        let interm = Instant::now();
        valid = false;

        while !valid {
            key[k] = t;
            let c_max = k + 1;
            c = 0;

            valid = true;
            set.resize(t as usize * 5, 0);
            'outer: for (c1, &k1) in zip(0.., key[0..c_max].iter()) {
                for (c2, &k2) in zip(c1.., key[c1..c_max].iter()) {
                    for (c3, &k3) in zip(c2.., key[c2..c_max].iter()) {
                        for (c4, &k4) in zip(c3.., key[c3..c_max].iter()) {
                            for (c5, &k5) in zip(c4.., key[c4..c_max].iter()) {
                                if c1 != c5 {
                                    let e = &mut set[(k1 + k2 + k3 + k4 + k5) as usize];
                                    if *e == t {
                                        valid = false;
                                        break 'outer;
                                    } else {
                                        *e = t;
                                    }
                                }
                            }
                        }
                    }
                }
            }

            if valid {
                println!("key[{}]={:?}", k, t);
                println!("c={:?}", c);
                let end = Instant::now();
                println!("\truntime for key[{}] = {:?}", k, (end - interm));
                println!("\truntime for key[{}] = {:?}", k, (end - start));
            } else {
                t += 1;
            }
        }
        // k += 1;
    }

    let end = Instant::now();
    println!("runtime = {:?}", (end - start));
}
15 Likes

Well @steffahn you nailed it ! :clap: Fantastic !

It is both clever (using t as flag in set so no resettings values to zero) and elegant syntax (using &mut set[sum] for read then write, break on named for, implied bounds in zip). A lot of recipes to take note.

Repo updated with optimized algo (v2).
By the way I also ran the algo up to 13 cards, and added the seven card version as _v2 scripts.

For the record, the full search now runs in seconds in both cases:

# 5 cards

start key-gen-face-five-v2
bootstrap -> keys=[0, 1, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
runtime = 155.895923ms
key=[0, 1, 5, 22, 94, 312, 992, 2422, 5624, 12522, 19998, 43258, 79415]

# 7 cards
start key-gen-face-seven-v2
bootstrap -> keys=[0, 1, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
runtime = 28.710379477s
key=[0, 1, 5, 22, 98, 453, 2031, 8698, 22854, 83661, 262349, 636345, 1479181]

Hey, I'm really glad I came here ! :+1:

1 Like

For completeness, here’s a parallelized version, that seems to generally work well, now that I’ve determined where to cut off the parallelism for small k.

use std::{iter::zip, sync::Mutex, thread, time::Instant};

pub fn main() {
    println!("start key-gen-face-five");

    // init
    let mut t: u32; // t=trial key, k=index searched key

    // let mut sums = [0; 50000]; // array of all possible sums of key[c[1-5]]
    let sets = &Mutex::<Vec<Vec<u32>>>::default();

    let mut key = [0, 1, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]; // init keys - empirical

    println!("bootstrap -> keys={:?}", key);

    let start = Instant::now();
    t = 5;
    for k in 3..13 {
        println!("searching for k={}", k);
        // sanity check, `t` is always just increasing;
        // An always unique / fresh `t` value is important for correct representation of the
        // `set` array which thus allows us to skip any step of *clear*ing the set
        // since with incremented `t` all entries are considered empty again.
        assert_eq!(key[k - 1] + 1u32, t + 1);
        t = key[k - 1] + 1u32;
        let interm = Instant::now();

        let parallel: usize = std::thread::available_parallelism().unwrap().into();
        // on my machine, spawning thread seeme worth it starting at k == 8
        let n: u32 = if k < 8 { 1 } else { parallel as u32 };
        let found = Mutex::new(None);
        let found_ref = &found;
        thread::scope(|s| {
            for offset in 0..n {
                let mut key = key;
                // collects the set of values originally in `sums`
                // value `s` is in the set if the entry `set[s]` contains the current `t` value
                let mut set: Vec<u32> = sets.lock().unwrap().pop().unwrap_or_default();
                let task = move || {
                    'outer: for (iteration, t) in (t + offset..).step_by(n as usize).enumerate() {
                        if iteration % 128 == 0 {
                            if let Some(t2) = *found_ref.lock().unwrap() {
                                if t2 < t {
                                    break;
                                }
                            }
                        }
                        key[k] = t;
                        let c_max = k + 1;

                        set.resize(t as usize * 5, 0);
                        for (c1, &k1) in zip(0.., key[0..c_max].iter()) {
                            for (c2, &k2) in zip(c1.., key[c1..c_max].iter()) {
                                for (c3, &k3) in zip(c2.., key[c2..c_max].iter()) {
                                    for (c4, &k4) in zip(c3.., key[c3..c_max].iter()) {
                                        for (c5, &k5) in zip(c4.., key[c4..c_max].iter()) {
                                            if c1 != c5 {
                                                let e = &mut set[(k1 + k2 + k3 + k4 + k5) as usize];
                                                if *e == t {
                                                    continue 'outer;
                                                } else {
                                                    *e = t;
                                                }
                                            }
                                        }
                                    }
                                }
                            }
                        }

                        let found = &mut *found_ref.lock().unwrap();
                        match found {
                            None => *found = Some(t),
                            Some(t2) if *t2 > t => *found = Some(t),
                            _ => (),
                        }
                        sets.lock().unwrap().push(set);
                        return;
                    }
                };
                if offset == n - 1 {
                    task();
                } else {
                    s.spawn(task);
                }
            }
        });
        t = found.into_inner().unwrap().unwrap();

        key[k] = t;

        println!("key[{}]={:?}", k, t);
        let end = Instant::now();
        println!("\truntime for key[{}] = {:?}", k, (end - interm));
        println!("\truntime for key[{}] = {:?}", k, (end - start));
    }

    let end = Instant::now();
    println!("runtime = {:?}", (end - start));
}

This isn't related to your parallelization strategy, I have not spent the time to try to understand your parallel code, so I can't confidently say which approach to parallelize would be better.

3 Likes

@steffahn thx again for the valuable and precise answer !
I do appreciate :+1:

Your parallel strategy seems to have less overhead - not sure why - so it is about 2x faster for short runtimes (five_v2: 50ms vs 100ms) or 30% faster for long runtimes (five_v2: 7.25s vs 10.6s).

And I made the following changes before running tests:

  • apply this test at every iteration because the following compute is expensive
    if let Some(t2) = *found_ref.lock().unwrap() { if t2 < t { break; } }
  • Initialize and use a set in each thread let mut set = vec![]; because popping the latest set from sets could contain a value equal to current t interfere with local sum tests - I think

All this is really a good source of inspiration !

repo updated.

1 Like

I thought about maybe contention on the Mutex is relevant. I haven’t had the easiest time to get reliable results on that; also I was too lazy to work up an approach to use atomics instead. Of course, realistically, the loop is relatively long in-between each check … also probably gets worse for the 7 cards… I really never went back to testing the impact (positive or negative) of not checking on every loop, and if so, how many iterations best to skip. (Edit: I tried it out a few more times just no, it’s still no significant difference either way, as far as I can tell, but that probably means no significant contention.)

Oh, yes, good catch! Of course, searching ahead on multiple threads means that with the next k, a t could have been used before. If skipping the re-use turns out significant overhead in any way (most likely not really), one could probably somehow incorporate k into the unique value, too… maybe even with some bit manipulation. Or make a counter for such “fresh/unique value” separate from t. On second thought now, maybe even just have it paired up with the set data structure itself directly.

For reference pls find the code resulting from this conversation:

It contains all the tips discussed above.

1 Like

Could you make a final Update to the runtime Table

Also I am surprised how much more performance good writen code produces

Below is the final perf table.

Yes indeed, the perf differences are massive. This is because (1) the initial implementation/compilation was very much "unfinished" (2) this task is cpu intensive and quite sensitive to implementation - there is nothing generic about theses results.

algo compiler opt_level runtime best vs. naive & not parallel
optimized rust release parallel safe b 0.07 0.05
optimized rust release parallel safe a 0.17 0.11
optimized rust release safe 0.19 0.12
naive rust release parallel unsafe 0.32 0.21
naive gcc -O3 1.54 1.0
naive rust release v2 unsafe 1.56 1.01
naive rust release v3 safe 2.2 1.43
naive rust release v3b safe 2.22 1.44
naive clang -O3 2.27 1.47
naive clang -O1 2.3 1.49
naive go 2.34 1.52
naive rust release v4 safe 2.62 1.7
naive gcc -O2 2.9 1.88
naive rust release v5 safe 2.89 1.88
naive gcc -O1 2.95 1.92
naive clang -O2 4.35 2.82
naive gcc 10.1 6.56
naive clang 11.53 7.49
naive rust debug v1 16.61 10.79

For ref this problem was a keygen search for a poker crate that is now published.