Rust vs. C vs. Go runtime speed comparison

oscar6echo · December 18, 2023, 9:00am

I have written the same simple numerical program in Rust, C, Go, and was hoping that Rust runtime would be about as fast as C. It turns out that it is 10x slower than the best C compiled program, and 7x slower than the go version.

compiler	opt_level	runtime	ratio vs. best
gcc	-O3	1.54	1.0
clang	-O3	2.27	1.47
clang	-O1	2.3	1.49
go		2.34	1.52
gcc	-O2	2.9	1.88
gcc	-O1	2.95	1.92
clang	-O2	4.35	2.82
gcc		10.1	6.56
clang		11.53	7.49
rust		16.62	10.79

Everything to reproduce is in this repo: GitHub - oscar6echo/rust-c-go-speed

It is clear that compiler decisions has a big impact on the runtime, as demonstraded by best compiler (gcc) wide distribution of runtimes.

But Rust is such an outlier in here that I MUST have done something VERY suboptimal. However I have fiddled with several variations - unsafe blocks, get_unchecked array access, vec instead of array - the Rust implementation I show in the repo is the fastest I could produce !

Looking forward to expert advice.
As it stands, I am a bit disappointed by Rust runtime speed.
But I would love to be proven wrong !!

jofas · December 18, 2023, 9:04am

Have you enabled optimizations? I.e. compiled with cargo build --release or rustc -Copt-level=3?

SebastianJL · December 18, 2023, 9:17am

Just ran it on my crappy phone:
Clang -03: 7 seconds
cargo r -r: 16 seconds

jer · December 18, 2023, 9:18am

From the README in the Rust directory:

################# runtimes
cargo build --release

cargo run

You first build with --release (and thus optimizations), but then use cargo run which defaults to debug mode and thus first compiles it again and then runs target/debug/rust-runtime-speed-test, which will then be much slower than the release build.

oscar6echo · December 18, 2023, 10:19am

Have you enabled optimizations? I.e. compiled with cargo build --release or rustc -Copt-level=3 ?

Yes: I do cargo buid --release

You first build with --release (and thus optimizations), but then use cargo run which defaults to debug mode and thus first compiles it again and then runs target/debug/rust-runtime-speed-test, which will then be much slower than the release build.

But as pointed out, i was (stupidly ) running the debug version.
If I do cargo run --release then the runtime goes down from 16.7s to 5.6s. Thx a lot for pointing it out
Repo updated accordingly: GitHub - oscar6echo/rust-c-go-speed

Much better but still "rust --release" is x3.6 times slower than the best C and 2.4x slower than go.

compiler	opt_level	runtime	ratio vs. best
gcc	-O3	1.54	1.0
clang	-O3	2.27	1.47
clang	-O1	2.3	1.49
go		2.34	1.52
gcc	-O2	2.9	1.88
gcc	-O1	2.95	1.92
clang	-O2	4.35	2.82
rust	--release	5.61	3.64
gcc		10.1	6.56
clang		11.53	7.49
rust	--debug	16.62	10.79

Is there still room for improvement ?

jameseb7 · December 18, 2023, 10:28am

If you are measuring cargo run --release then you may be measuring compile time as well. Try running the executable in the target/release directory directly and timing that.

Hyeonu · December 18, 2023, 10:28am

One easy thing to check would be to ensure you're not comparing the perf of printf-equivalents. What happens if you comment out all those log lines?

Bruecki · December 18, 2023, 10:37am

The time measurements are in the code.

SkiFire13 · December 18, 2023, 11:54am

Loops with inclusive end (the ones that end in ..=k in your code) are notoriously harder to optimize. Try changing them to ..k+1 instead.

oscar6echo · December 18, 2023, 12:23pm

If you are measuring cargo run --release then you may be measuring compile time as well. Try running the executable in the target/release directory directly and timing that.

One easy thing to check would be to ensure you're not comparing the perf of printf-equivalents. What happens if you comment out all those log lines?

Indeed as mentionned by @Bruecki the print statements are in the code and are very few so do not impact the timing meaningfully.

I'm afraid the issue is deeper, with how the compiler handles loops maybe..

Loops with inclusive end (the ones that end in ..=k in your code) are notoriously harder to optimize. Try changing them to ..k+1 instead.

Thx for pointing it out I had tried before but only with cargo run. Now I just did it with cargo run --release but: no visible change.
However I updated the repo with new code & timings.

mixal-iirec · December 18, 2023, 12:40pm

In both Rust and C you should remove this cold short-circuit in the hot loop:

while valid && j < c {

into:

while j < c {

Then in both Rust and C it results in ~0.4s on my PC.

And if you want to gain another 1 or 2 percents spent on bounds check, add:

assert!(c < sums.len());

before the hot loops.

robertknight · December 18, 2023, 12:42pm

The Rust and C versions are different because the C version uses do ... while (condition) for several loops whereas Rust uses while (condition). These are different in an important way because do ... while loops are guaranteed to run at least once.

Changing the Rust code to use loop ... break for the hot double-loop to match the logic of the C version gives the same output and a substantial speedup.

diff --git a/rust/src/key_gen_face_five.rs b/rust/src/key_gen_face_five.rs
index be2f6cb..76be788 100644
--- a/rust/src/key_gen_face_five.rs
+++ b/rust/src/key_gen_face_five.rs
@@ -67,16 +67,22 @@ pub fn main() {
 
             i = 0;
             valid = true;
-
-            while valid && i < c - 1 {
+            loop {
                 j = i + 1;
-                while valid && j < c {
+                loop {
                     if sums[i] == sums[j] {
                         valid = false;
                     }
                     j += 1;
+                    if !(valid && j < c) {
+                        break;
+                    }
                 }
                 i += 1;
+
+                if !(valid && i < c - 1) {
+                    break;
+                }
             }
 
             // loop {

If you add in bounds-check elimination using get_unchecked + get_unchecked_mut, you'll get an extra speedup at the cost of memory unsafety. Note that to get the benefits you generally have to eliminate all bounds checks in an inner loop. Code like foo[index] = unsafe { bar[b] + bar[c] } won't get the benefit because part of the statement (foo[index]) still requires the bounds check.

steffahn · December 18, 2023, 1:05pm

I’ve gotten a significant speedup by writing a more reasonable / Rust-like

            valid = true;
            for (i, s) in sums[..c-1].iter().enumerate() {
                if sums[i+1..c].contains(s) {
                    valid = false;
                    break;
                }
            }

In any case, the inner loop here is probably the most important factor for the whole code’s performance, as it runs by far the most often.

Especially the bounds-check of the sums[i] == sums[j] alone seem significant, as @robertknight just pointed out already.

steffahn · December 18, 2023, 1:22pm

Indeed, using the existing commented-out loop based code in your repo and making the bounds check go away

            loop {
                j = i + 1;
                loop {
                    // I’ve only modified the bounds check here
                    if unsafe { sums.get_unchecked(i) == sums.get_unchecked(j) } {
                        valid = false;
                    }
                    j += 1;

                    if !(valid && j < c) {
                        break;
                    }
                }
                i += 1;
                if !(valid && i < c - 1) {
                    break;
                }
            }

also makes it run about on par with C on my machine

Of course, this quadratic-time duplicate detection is suboptimal anyways, and also it’s sad that this dominates the entire run-time, so the whole other deeply nested loop with sums, etc, is all almost entirely irrelevant and we’re essentially only benchmarking a quadratic-time duplicate detection algorithm.

oscar6echo · December 18, 2023, 3:41pm

Thanks A LOT for the many answers and tips. I have tried them all and here is the status:

re @robertknight remark: I have indeed changed the code to a do..while ie. loop..break in rust to compare the exact algo, and the bound checking remarks led me to the final extra critical improvement
the adjustments proposed by "mixal-iirec" yield a few percent perf, so while not negligible, not critical either
@steffahn the unsafe check on sums i and j is indeed critical

At this stage I'm on par with clang -O2 runtime.

In addition I improved the sums[c] allocation so that the compiler does not perform the costly bound check:

// without unsafe !
let bounded_c = cmp::min(c, sums.len());
sums[bounded_c] = key[c1] + key[c2] + key[c3] + key[c4] + key[c5];

This cuts the time in 3 !
Cf. updated repo.

Then I'm on par with gcc -O3, the best C solution !!
to all. This collective effort cut the run time by 10x in 6h !!!
I'm impressed but Rust perf AND the community reactivity

NOTE:
Strangely

I tried to apply the same trick for sums[i]==sums[j] but it did not improve the perf:

// 1 - test used in debug mode
// if sums[i] == sums[j] {

// 2 - candidate test for release mode - not efficient
// assert!(i < sums.len());
// assert!(j < sums.len());
// if sums[i] == sums[j] {

// 2bis - candidate test for release mode - not efficient
// let bounded_i = cmp::min(i, sums.len());
// let bounded_j = cmp::min(j, sums.len());
// if sums[bounded_i] == sums[bounded_j] {

// 3 - test used in release mode - very efficient
if unsafe { sums.get_unchecked(i) == sums.get_unchecked(j) } {
    valid = false;
}

I do not want to nit pick beyond reasonable but if solution 2bis worked it would be just ideal as 1/ nothing unsafe and 2/ as fast as possible (usually, here too, given by gcc -O3).

steffahn · December 18, 2023, 4:01pm

As I hinted at above where I already presented the contains method, if the goal isn’t 1-to-1 code “equivalence” (well and if it is then unsafe unchecked indexing is the way to match C, of course, because that’s what C does!!) then the decent safe way to avoid the bounds check in Rust is to use iterators, at least for the innermost loop; like the .contains method does.

                j = i + 1;
                loop {
                    // 3 - test used in release mode - very efficient
                    if unsafe { sums.get_unchecked(i) == sums.get_unchecked(j) } {
                        valid = false;
                    }
                    j += 1;

                    if !(valid && j < c) {
                        break;
                    }
                }

                // safe, just as fast, use contains

                // if sums[i+1..c].contains(&sums[i]) {
                //     valid = false;
                // }

                // safe, also just as fast, use iterator

                // let sums_i = sums[i];
                // for &sums_j in &sums[i+1..c] {
                //     if sums_i == sums_j {
                //         valid = false;
                //         break;
                //     }
                // }

I’m currently confused why the contains solution from before doesn’t perform as good as I remembered it; I will need to investigate that a little.

steffahn · December 18, 2023, 4:23pm

I give up; this may be weird side-effects of benchmarking random compilation effects. Look at this madness:

            valid = !(0..c - 1).any(|i| {
                sums[i]; // wtf, why does adding this line make a HUGE difference?

                sums[i + 1..c].contains(&sums[i])
            });

The access to sums[i] at that point changes the times for me from 3.4s to 2.6s… this thing isn’t even in the innermost loop, the .contains call is the innermost loop.

Edit: It also depends on the usage of the let bounded_c = cmp::min(c, sums.len()) in the upper loop. Take that away, and suddenly the

            for (i, s) in sums[..c - 1].iter().enumerate() {
                if sums[i + 1..c].contains(s) {
                    valid = false;
                    break;
                }
            }

loop is also fast again.

oscar6echo · December 18, 2023, 5:20pm

@steffahn thx for insisting !

I tried your various proposals and updated the repo accordingly.

Now the perf table is:

compiler	opt_level	runtime	ratio vs. best
gcc	-O3	1.54	1.0
rust	release v2 unsafe	1.56	1.01
rust	release v3 safe	2.2	1.43
clang	-O3	2.27	1.47
clang	-O1	2.3	1.49
go		2.34	1.52
rust	release v4 safe	2.62	1.7
gcc	-O2	2.9	1.88
rust	release v5 safe	2.89	1.88
gcc	-O1	2.95	1.92
clang	-O2	4.35	2.82
gcc		10.1	6.56
clang		11.53	7.49
rust	debug v1	16.61	10.79

Summary:
The result is very satisfactory:

unsafe mod is as fast as can be (gcc -O3)
fastest safe mode is not far behind and in second position !

Besides, because the compiler behavior is not entirely clear (when looking for perf) several similar speed solutions are available - maybe for future inspiration.

Thanks to all for participating in this exploration !

PS: I started looking into Rust last week and am very impressed by doc, tooling, perf, community !

SkiFire13 · December 18, 2023, 5:32pm

Note that if bounded_c ends up being equal to sums.len() it will still be out of bounds. Maybe you wanted to use sums.len() - 1?

oscar6echo · December 18, 2023, 6:08pm

Note that if bounded_c ends up being equal to sums.len() it will still be out of bounds. Maybe you wanted to use sums.len() - 1 ?

Your remark makes perfect sense, but:

in this algo the size of the array just needs be large enough, so it is simpler to define it a bit larger at inception. The point is to avoid bounds check.
more importantly and REALLY SURPRISING, perf goes x2 if I write:

// slow
let bounded_c = cmp::min(c, sums.len()-1);

// slow
// same thing if I define at inception:
let sums_max_idx = sums.len() - 1;
// then
let bounded_c = cmp::min(c, sums_max_idx);

So it seems the compiler will not check bounds only if the index is capped by sums.len() is used, which is not the proper limit !!!

This exploration yields another discovery - to me.

I wonder how to report this strange behavior to the Rust devs.
Maybe bounding an array index with 0 and len()-1 could be enough to remove bound checks, even in safe mode...

Topic		Replies	Views
Performance of array access vs C	32	5570	August 30, 2020
Clippy driving me to insanity, insisting on iterators	42	5069	April 13, 2020
Unexpected performance from array bound tests and more	13	2239	January 12, 2023
How to explain slowdown with Sieve of Eratosthenes versus Nim and C? help	12	3374	January 12, 2023
How to improve rust performance to 1.4 seconds? help	32	2012	September 14, 2023

Rust vs. C vs. Go runtime speed comparison

Related topics