Rust vs. C vs. Go runtime speed comparison

I have written the same simple numerical program in Rust, C, Go, and was hoping that Rust runtime would be about as fast as C. It turns out that it is 10x slower than the best C compiled program, and 7x slower than the go version.

compiler opt_level runtime ratio vs. best
gcc -O3 1.54 1.0
clang -O3 2.27 1.47
clang -O1 2.3 1.49
go 2.34 1.52
gcc -O2 2.9 1.88
gcc -O1 2.95 1.92
clang -O2 4.35 2.82
gcc 10.1 6.56
clang 11.53 7.49
rust 16.62 10.79

Everything to reproduce is in this repo: GitHub - oscar6echo/rust-c-go-speed

It is clear that compiler decisions has a big impact on the runtime, as demonstraded by best compiler (gcc) wide distribution of runtimes.

But Rust is such an outlier in here that I MUST have done something VERY suboptimal. However I have fiddled with several variations - unsafe blocks, get_unchecked array access, vec instead of array - the Rust implementation I show in the repo is the fastest I could produce !

Looking forward to expert advice.
As it stands, I am a bit disappointed by Rust runtime speed.
But I would love to be proven wrong !!

1 Like

Have you enabled optimizations? I.e. compiled with cargo build --release or rustc -Copt-level=3?

1 Like

Just ran it on my crappy phone:
Clang -03: 7 seconds
cargo r -r: 16 seconds

From the README in the Rust directory:

################# runtimes
cargo build --release

cargo run

You first build with --release (and thus optimizations), but then use cargo run which defaults to debug mode and thus first compiles it again and then runs target/debug/rust-runtime-speed-test, which will then be much slower than the release build.

13 Likes

Have you enabled optimizations? I.e. compiled with cargo build --release or rustc -Copt-level=3 ?

Yes: I do cargo buid --release

You first build with --release (and thus optimizations), but then use cargo run which defaults to debug mode and thus first compiles it again and then runs target/debug/rust-runtime-speed-test, which will then be much slower than the release build.

But as pointed out, i was (stupidly :stuck_out_tongue_closed_eyes: ) running the debug version.
If I do cargo run --release then the runtime goes down from 16.7s to 5.6s. Thx a lot for pointing it out :pray:
Repo updated accordingly: GitHub - oscar6echo/rust-c-go-speed

Much better but still "rust --release" is x3.6 times slower than the best C and 2.4x slower than go.

compiler opt_level runtime ratio vs. best
gcc -O3 1.54 1.0
clang -O3 2.27 1.47
clang -O1 2.3 1.49
go 2.34 1.52
gcc -O2 2.9 1.88
gcc -O1 2.95 1.92
clang -O2 4.35 2.82
rust --release 5.61 3.64
gcc 10.1 6.56
clang 11.53 7.49
rust --debug 16.62 10.79

Is there still room for improvement ?

1 Like

If you are measuring cargo run --release then you may be measuring compile time as well. Try running the executable in the target/release directory directly and timing that.

4 Likes

One easy thing to check would be to ensure you're not comparing the perf of printf-equivalents. What happens if you comment out all those log lines?

5 Likes

The time measurements are in the code.

1 Like

Loops with inclusive end (the ones that end in ..=k in your code) are notoriously harder to optimize. Try changing them to ..k+1 instead.

7 Likes

If you are measuring cargo run --release then you may be measuring compile time as well. Try running the executable in the target/release directory directly and timing that.

One easy thing to check would be to ensure you're not comparing the perf of printf-equivalents. What happens if you comment out all those log lines?

Indeed as mentionned by @Bruecki the print statements are in the code and are very few so do not impact the timing meaningfully.

I'm afraid the issue is deeper, with how the compiler handles loops maybe..

Loops with inclusive end (the ones that end in ..=k in your code) are notoriously harder to optimize. Try changing them to ..k+1 instead.

Thx for pointing it out :+1: I had tried before but only with cargo run. Now I just did it with cargo run --release but: no visible change.
However I updated the repo with new code & timings.

In both Rust and C you should remove this cold short-circuit in the hot loop:

while valid && j < c {

into:

while j < c {

Then in both Rust and C it results in ~0.4s on my PC.

And if you want to gain another 1 or 2 percents spent on bounds check, add:

assert!(c < sums.len());

before the hot loops.

5 Likes

The Rust and C versions are different because the C version uses do ... while (condition) for several loops whereas Rust uses while (condition). These are different in an important way because do ... while loops are guaranteed to run at least once.

Changing the Rust code to use loop ... break for the hot double-loop to match the logic of the C version gives the same output and a substantial speedup.

diff --git a/rust/src/key_gen_face_five.rs b/rust/src/key_gen_face_five.rs
index be2f6cb..76be788 100644
--- a/rust/src/key_gen_face_five.rs
+++ b/rust/src/key_gen_face_five.rs
@@ -67,16 +67,22 @@ pub fn main() {
 
             i = 0;
             valid = true;
-
-            while valid && i < c - 1 {
+            loop {
                 j = i + 1;
-                while valid && j < c {
+                loop {
                     if sums[i] == sums[j] {
                         valid = false;
                     }
                     j += 1;
+                    if !(valid && j < c) {
+                        break;
+                    }
                 }
                 i += 1;
+
+                if !(valid && i < c - 1) {
+                    break;
+                }
             }
 
             // loop {

If you add in bounds-check elimination using get_unchecked + get_unchecked_mut, you'll get an extra speedup at the cost of memory unsafety. Note that to get the benefits you generally have to eliminate all bounds checks in an inner loop. Code like foo[index] = unsafe { bar[b] + bar[c] } won't get the benefit because part of the statement (foo[index]) still requires the bounds check.

11 Likes

I’ve gotten a significant speedup by writing a more reasonable / Rust-like

            valid = true;
            for (i, s) in sums[..c-1].iter().enumerate() {
                if sums[i+1..c].contains(s) {
                    valid = false;
                    break;
                }
            }

In any case, the inner loop here is probably the most important factor for the whole code’s performance, as it runs by far the most often.

Especially the bounds-check of the sums[i] == sums[j] alone seem significant, as @robertknight just pointed out already.

9 Likes

Indeed, using the existing commented-out loop based code in your repo and making the bounds check go away

            loop {
                j = i + 1;
                loop {
                    // I’ve only modified the bounds check here
                    if unsafe { sums.get_unchecked(i) == sums.get_unchecked(j) } {
                        valid = false;
                    }
                    j += 1;

                    if !(valid && j < c) {
                        break;
                    }
                }
                i += 1;
                if !(valid && i < c - 1) {
                    break;
                }
            }

also makes it run about on par with C on my machine


Of course, this quadratic-time duplicate detection is suboptimal anyways, and also it’s sad that this dominates the entire run-time, so the whole other deeply nested loop with sums, etc, is all almost entirely irrelevant and we’re essentially only benchmarking a quadratic-time duplicate detection algorithm.

8 Likes

Thanks A LOT for the many answers and tips. I have tried them all and here is the status:

  • re @robertknight remark: I have indeed changed the code to a do..while ie. loop..break in rust to compare the exact algo, and the bound checking remarks led me to the final extra critical improvement
  • the adjustments proposed by "mixal-iirec" yield a few percent perf, so while not negligible, not critical either
  • @steffahn the unsafe check on sums i and j is indeed critical

At this stage I'm on par with clang -O2 runtime.

In addition I improved the sums[c] allocation so that the compiler does not perform the costly bound check:

// without unsafe !
let bounded_c = cmp::min(c, sums.len());
sums[bounded_c] = key[c1] + key[c2] + key[c3] + key[c4] + key[c5];

This cuts the time in 3 !
Cf. updated repo.

Then I'm on par with gcc -O3, the best C solution !!
:clap: to all. This collective effort cut the run time by 10x in 6h !!!
I'm impressed but Rust perf AND the community reactivity :pray:

NOTE:
Strangely

  • I tried to apply the same trick for sums[i]==sums[j] but it did not improve the perf:
// 1 - test used in debug mode
// if sums[i] == sums[j] {

// 2 - candidate test for release mode - not efficient
// assert!(i < sums.len());
// assert!(j < sums.len());
// if sums[i] == sums[j] {

// 2bis - candidate test for release mode - not efficient
// let bounded_i = cmp::min(i, sums.len());
// let bounded_j = cmp::min(j, sums.len());
// if sums[bounded_i] == sums[bounded_j] {

// 3 - test used in release mode - very efficient
if unsafe { sums.get_unchecked(i) == sums.get_unchecked(j) } {
    valid = false;
}

I do not want to nit pick beyond reasonable but if solution 2bis worked it would be just ideal as 1/ nothing unsafe and 2/ as fast as possible (usually, here too, given by gcc -O3).

As I hinted at above where I already presented the contains method, if the goal isn’t 1-to-1 code “equivalence” (well and if it is then unsafe unchecked indexing is the way to match C, of course, because that’s what C does!!) then the decent safe way to avoid the bounds check in Rust is to use iterators, at least for the innermost loop; like the .contains method does.

                j = i + 1;
                loop {
                    // 3 - test used in release mode - very efficient
                    if unsafe { sums.get_unchecked(i) == sums.get_unchecked(j) } {
                        valid = false;
                    }
                    j += 1;

                    if !(valid && j < c) {
                        break;
                    }
                }

                // safe, just as fast, use contains

                // if sums[i+1..c].contains(&sums[i]) {
                //     valid = false;
                // }

                // safe, also just as fast, use iterator

                // let sums_i = sums[i];
                // for &sums_j in &sums[i+1..c] {
                //     if sums_i == sums_j {
                //         valid = false;
                //         break;
                //     }
                // }

I’m currently confused why the contains solution from before doesn’t perform as good as I remembered it; I will need to investigate that a little.

1 Like

I give up; this may be weird side-effects of benchmarking random compilation effects. Look at this madness:

            valid = !(0..c - 1).any(|i| {
                sums[i]; // wtf, why does adding this line make a HUGE difference?

                sums[i + 1..c].contains(&sums[i])
            });

The access to sums[i] at that point changes the times for me from 3.4s to 2.6s… this thing isn’t even in the innermost loop, the .contains call is the innermost loop.


Edit: It also depends on the usage of the let bounded_c = cmp::min(c, sums.len()) in the upper loop. Take that away, and suddenly the

            for (i, s) in sums[..c - 1].iter().enumerate() {
                if sums[i + 1..c].contains(s) {
                    valid = false;
                    break;
                }
            }

loop is also fast again. :man_shrugging:

2 Likes

@steffahn thx for insisting !

I tried your various proposals and updated the repo accordingly.

Now the perf table is:

compiler opt_level runtime ratio vs. best
gcc -O3 1.54 1.0
rust release v2 unsafe 1.56 1.01
rust release v3 safe 2.2 1.43
clang -O3 2.27 1.47
clang -O1 2.3 1.49
go 2.34 1.52
rust release v4 safe 2.62 1.7
gcc -O2 2.9 1.88
rust release v5 safe 2.89 1.88
gcc -O1 2.95 1.92
clang -O2 4.35 2.82
gcc 10.1 6.56
clang 11.53 7.49
rust debug v1 16.61 10.79

Summary:
The result is very satisfactory:

  • unsafe mod is as fast as can be (gcc -O3)
  • fastest safe mode is not far behind and in second position !

Besides, because the compiler behavior is not entirely clear (when looking for perf) several similar speed solutions are available - maybe for future inspiration.

Thanks to all for participating in this exploration !
:pray:

PS: I started looking into Rust last week and am very impressed by doc, tooling, perf, community !

6 Likes

Note that if bounded_c ends up being equal to sums.len() it will still be out of bounds. Maybe you wanted to use sums.len() - 1?

Note that if bounded_c ends up being equal to sums.len() it will still be out of bounds. Maybe you wanted to use sums.len() - 1 ?

Your remark makes perfect sense, but:

  • in this algo the size of the array just needs be large enough, so it is simpler to define it a bit larger at inception. The point is to avoid bounds check.
  • more importantly and REALLY SURPRISING, perf goes x2 if I write:
// slow
let bounded_c = cmp::min(c, sums.len()-1);

// slow
// same thing if I define at inception:
let sums_max_idx = sums.len() - 1;
// then
let bounded_c = cmp::min(c, sums_max_idx);

So it seems the compiler will not check bounds only if the index is capped by sums.len() is used, which is not the proper limit !!!

This exploration yields another discovery - to me.

I wonder how to report this strange behavior to the Rust devs.
Maybe bounding an array index with 0 and len()-1 could be enough to remove bound checks, even in safe mode...