I have written the same simple numerical program in Rust, C, Go, and was hoping that Rust runtime would be about as fast as C. It turns out that it is 10x slower than the best C compiled program, and 7x slower than the go version.
It is clear that compiler decisions has a big impact on the runtime, as demonstraded by best compiler (gcc) wide distribution of runtimes.
But Rust is such an outlier in here that I MUST have done something VERY suboptimal. However I have fiddled with several variations - unsafe blocks, get_unchecked array access, vec instead of array - the Rust implementation I show in the repo is the fastest I could produce !
Looking forward to expert advice.
As it stands, I am a bit disappointed by Rust runtime speed.
But I would love to be proven wrong !!
################# runtimes
cargo build --release
cargo run
You first build with --release (and thus optimizations), but then use cargo run which defaults to debug mode and thus first compiles it again and then runs target/debug/rust-runtime-speed-test, which will then be much slower than the release build.
Have you enabled optimizations? I.e. compiled with cargo build --release or rustc -Copt-level=3 ?
Yes: I do cargo buid --release
You first build with --release (and thus optimizations), but then use cargo run which defaults to debug mode and thus first compiles it again and then runs target/debug/rust-runtime-speed-test, which will then be much slower than the release build.
But as pointed out, i was (stupidly ) running the debug version.
If I do cargo run --release then the runtime goes down from 16.7s to 5.6s. Thx a lot for pointing it out
Repo updated accordingly: GitHub - oscar6echo/rust-c-go-speed
Much better but still "rust --release" is x3.6 times slower than the best C and 2.4x slower than go.
If you are measuring cargo run --release then you may be measuring compile time as well. Try running the executable in the target/release directory directly and timing that.
If you are measuring cargo run --release then you may be measuring compile time as well. Try running the executable in the target/release directory directly and timing that.
One easy thing to check would be to ensure you're not comparing the perf of printf-equivalents. What happens if you comment out all those log lines?
Indeed as mentionned by @Bruecki the print statements are in the code and are very few so do not impact the timing meaningfully.
I'm afraid the issue is deeper, with how the compiler handles loops maybe..
Loops with inclusive end (the ones that end in ..=k in your code) are notoriously harder to optimize. Try changing them to ..k+1 instead.
Thx for pointing it out I had tried before but only with cargo run. Now I just did it with cargo run --release but: no visible change.
However I updated the repo with new code & timings.
The Rust and C versions are different because the C version uses do ... while (condition) for several loops whereas Rust uses while (condition). These are different in an important way because do ... while loops are guaranteed to run at least once.
Changing the Rust code to use loop ... break for the hot double-loop to match the logic of the C version gives the same output and a substantial speedup.
diff --git a/rust/src/key_gen_face_five.rs b/rust/src/key_gen_face_five.rs
index be2f6cb..76be788 100644
--- a/rust/src/key_gen_face_five.rs
+++ b/rust/src/key_gen_face_five.rs
@@ -67,16 +67,22 @@ pub fn main() {
i = 0;
valid = true;
-
- while valid && i < c - 1 {
+ loop {
j = i + 1;
- while valid && j < c {
+ loop {
if sums[i] == sums[j] {
valid = false;
}
j += 1;
+ if !(valid && j < c) {
+ break;
+ }
}
i += 1;
+
+ if !(valid && i < c - 1) {
+ break;
+ }
}
// loop {
If you add in bounds-check elimination using get_unchecked + get_unchecked_mut, you'll get an extra speedup at the cost of memory unsafety. Note that to get the benefits you generally have to eliminate all bounds checks in an inner loop. Code like foo[index] = unsafe { bar[b] + bar[c] } won't get the benefit because part of the statement (foo[index]) still requires the bounds check.
Indeed, using the existing commented-out loop based code in your repo and making the bounds check go away
loop {
j = i + 1;
loop {
// I’ve only modified the bounds check here
if unsafe { sums.get_unchecked(i) == sums.get_unchecked(j) } {
valid = false;
}
j += 1;
if !(valid && j < c) {
break;
}
}
i += 1;
if !(valid && i < c - 1) {
break;
}
}
also makes it run about on par with C on my machine
Of course, this quadratic-time duplicate detection is suboptimal anyways, and also it’s sad that this dominates the entire run-time, so the whole other deeply nested loop with sums, etc, is all almost entirely irrelevant and we’re essentially only benchmarking a quadratic-time duplicate detection algorithm.
Thanks A LOT for the many answers and tips. I have tried them all and here is the status:
re @robertknight remark: I have indeed changed the code to a do..while ie. loop..break in rust to compare the exact algo, and the bound checking remarks led me to the final extra critical improvement
the adjustments proposed by "mixal-iirec" yield a few percent perf, so while not negligible, not critical either
@steffahn the unsafe check on sums i and j is indeed critical
At this stage I'm on par with clang -O2 runtime.
In addition I improved the sums[c] allocation so that the compiler does not perform the costly bound check:
// without unsafe !
let bounded_c = cmp::min(c, sums.len());
sums[bounded_c] = key[c1] + key[c2] + key[c3] + key[c4] + key[c5];
This cuts the time in 3 !
Cf. updated repo.
Then I'm on par with gcc -O3, the best C solution !! to all. This collective effort cut the run time by 10x in 6h !!!
I'm impressed but Rust perf AND the community reactivity
NOTE:
Strangely
I tried to apply the same trick for sums[i]==sums[j] but it did not improve the perf:
// 1 - test used in debug mode
// if sums[i] == sums[j] {
// 2 - candidate test for release mode - not efficient
// assert!(i < sums.len());
// assert!(j < sums.len());
// if sums[i] == sums[j] {
// 2bis - candidate test for release mode - not efficient
// let bounded_i = cmp::min(i, sums.len());
// let bounded_j = cmp::min(j, sums.len());
// if sums[bounded_i] == sums[bounded_j] {
// 3 - test used in release mode - very efficient
if unsafe { sums.get_unchecked(i) == sums.get_unchecked(j) } {
valid = false;
}
I do not want to nit pick beyond reasonable but if solution 2bis worked it would be just ideal as 1/ nothing unsafe and 2/ as fast as possible (usually, here too, given by gcc -O3).
As I hinted at above where I already presented the contains method, if the goal isn’t 1-to-1 code “equivalence” (well and if it is then unsafe unchecked indexing is the way to match C, of course, because that’s what C does!!) then the decent safe way to avoid the bounds check in Rust is to use iterators, at least for the innermost loop; like the .contains method does.
j = i + 1;
loop {
// 3 - test used in release mode - very efficient
if unsafe { sums.get_unchecked(i) == sums.get_unchecked(j) } {
valid = false;
}
j += 1;
if !(valid && j < c) {
break;
}
}
// safe, just as fast, use contains
// if sums[i+1..c].contains(&sums[i]) {
// valid = false;
// }
// safe, also just as fast, use iterator
// let sums_i = sums[i];
// for &sums_j in &sums[i+1..c] {
// if sums_i == sums_j {
// valid = false;
// break;
// }
// }
I’m currently confused why the contains solution from before doesn’t perform as good as I remembered it; I will need to investigate that a little.
I give up; this may be weird side-effects of benchmarking random compilation effects. Look at this madness:
valid = !(0..c - 1).any(|i| {
sums[i]; // wtf, why does adding this line make a HUGE difference?
sums[i + 1..c].contains(&sums[i])
});
The access to sums[i] at that point changes the times for me from 3.4s to 2.6s… this thing isn’t even in the innermost loop, the .contains call is the innermost loop.
Edit: It also depends on the usage of the let bounded_c = cmp::min(c, sums.len()) in the upper loop. Take that away, and suddenly the
for (i, s) in sums[..c - 1].iter().enumerate() {
if sums[i + 1..c].contains(s) {
valid = false;
break;
}
}
I tried your various proposals and updated the repo accordingly.
Now the perf table is:
compiler
opt_level
runtime
ratio vs. best
gcc
-O3
1.54
1.0
rust
release v2 unsafe
1.56
1.01
rust
release v3 safe
2.2
1.43
clang
-O3
2.27
1.47
clang
-O1
2.3
1.49
go
2.34
1.52
rust
release v4 safe
2.62
1.7
gcc
-O2
2.9
1.88
rust
release v5 safe
2.89
1.88
gcc
-O1
2.95
1.92
clang
-O2
4.35
2.82
gcc
10.1
6.56
clang
11.53
7.49
rust
debug v1
16.61
10.79
Summary:
The result is very satisfactory:
unsafe mod is as fast as can be (gcc -O3)
fastest safe mode is not far behind and in second position !
Besides, because the compiler behavior is not entirely clear (when looking for perf) several similar speed solutions are available - maybe for future inspiration.
Thanks to all for participating in this exploration !
PS: I started looking into Rust last week and am very impressed by doc, tooling, perf, community !
Note that if bounded_c ends up being equal to sums.len() it will still be out of bounds. Maybe you wanted to use sums.len() - 1 ?
Your remark makes perfect sense, but:
in this algo the size of the array just needs be large enough, so it is simpler to define it a bit larger at inception. The point is to avoid bounds check.
more importantly and REALLY SURPRISING, perf goes x2 if I write:
// slow
let bounded_c = cmp::min(c, sums.len()-1);
// slow
// same thing if I define at inception:
let sums_max_idx = sums.len() - 1;
// then
let bounded_c = cmp::min(c, sums_max_idx);
So it seems the compiler will not check bounds only if the index is capped by sums.len() is used, which is not the proper limit !!!
This exploration yields another discovery - to me.
I wonder how to report this strange behavior to the Rust devs.
Maybe bounding an array index with 0 and len()-1 could be enough to remove bound checks, even in safe mode...