Nim multithreading faster than Rust

jzakiya · December 28, 2019, 7:05pm

There is no string formatting done in the algorithm, the results are numeric.
There are a few place where I display values, but they are inconsequential to the algorithm's performance. You can comment them out to establish this for yourself.
The only necessary output is the number of twin primes and the last twin prime.

Maybe if you understood the algorithm you'll focus in on what I'm trying to get people to assess. Here's a link to my paper that describes the algorithm.

Again, my focus here is on doing Rust multi-threading faster, all other issues aside.

stevensonmt · December 29, 2019, 12:15am

Your issue is not that the rust implementation is slow. Your issue is that the rust implementation is specifically slower than the nim implementation that you claim to be a 1:1 equivalent transliteration. However your benchmarks compare rust with safety guarantees against nim with no safety checks. The question is flawed. Please post benchmarks of the nim implementation compiled without the --d:danger flag so that it can be established that the rust implementation is actually slower. Otherwise, you're essentially asking whether my daughter is faster than my son while having them race with my son wearing a 50 pound backpack. It's a nonsensical question.

Phlopsi · December 29, 2019, 11:19pm

Whenever you use the indexing operation [], it performs a bounds check, making sure you don't mess with data, that you shouldn't be dealing with or outright segfaulting your program. If you're absolutely certain, that all your operations do not go out of bounds, use get_unchecked or get_unchecked_mut, insead.

ZiCog · December 30, 2019, 5:46am

I have to chime in and say that I suspect is not the case that the slow down is because of Rust or Rayon.

I'm new to Rust and have written my first threaded code with Rayon over the Christmas break as an exercise. So I don't know any magic go faster Rust or Rayon tricks.

However my code is a conversion of a very well optimized algorithm in C++, I was amazed to find that it actually runs faster in my Rust translation. By about 20% ! This not a trivial thing either it is a solution to to the Project Euler problem number 256 : Tatami-Free Rooms problem: #256 Tatami-Free Rooms - Project Euler

All the code, original C++ and Rust is here: GitHub - ZiCog/tatami-rust: A solution to the Project Euler problem number 256 : Tatami-Free Rooms.

This has been part of an ongoing coding challenge, those guys are hard to beat. The current performance winners are shown in a chart here: Tatami Pi Fun. - Page 11 - Raspberry Pi Forums

The only thing that beats my Rust/Rayon solution just now is a new entry that uses hand crafted threading with the pthreads library directly rather than the C++ threads. It's not that much faster.

My conclusion is that Rust and Rayon perform very well.

As for readability or ease of writing threaded code in Rust vs Nim I have no idea. Save to say that it was not exactly hard in Rust using Rayon. I think people would find the threading parts of my Rust code quite minimal and easy to comprehend. Even if they are not familiar with Rayon.

jzakiya · December 30, 2019, 8:20am

The main implementation difference between the Nim and Rust versions is I use some global variables in Nim while I have to do parameter passing in Rust. The majority of the algorithmic work is done in the proc twins_sieve, for both.

As a test, I did a serial (single core) version for both, and removed everything needed to do multi-threading, and compiled accordingly. The Nim version was still comparatively much faster.

This suggests to me there's something else that's fundamentally slower about the Rust implementation. Is it parameter passing? bounds checking? borrowing/ownership stuff? compiler flags|directives?

I don't philosophically know enough about Rust to understand which of these issues (or others I don't even know about) could affect performance, let alone specific things to do about them.

I'm giving Rust the benefit of doubt that it can be faster in this case, though I currently don't know how to do it. There wasn't that much difference between the performance of Nim and D versions, and it's perplexing that Rust is much slower.

So it does look like it's not just a threading implementation issue with Rayon, but something maybe more encompassing, which actually is more insidious (for someone trying to learn Rust). Rust is hard enough to learn as it is without resorting to (undocumented) wizardry to make it be fast as possible (again, from a newbie perspective).

ZiCog · December 30, 2019, 12:43pm

As I and others suspected then. It's not threading or rayon that is the problem here.

I'm a long time C/C++ user and new to Rust. Naturally a few of my first Rust experiments were rewrites of little projects I had in C/C++. I wanted to see how Rust performed and how easy it was to get familar things done before committing further.

So far all my Rust rewrites have achieved performance parity with the original C/C++ versions. And in some cases surpassed it a little. Mind you, it does depend on platform, what is faster in Rust on my PC can be faster in C on the ARM of the Raspberry Pi. For example. And vice-versa.

Like you one of the first hurdles I met was how to deal with those pesky global variables so beloved of C programmers? So far I have tried:

Wrap all the globals into a struct or perhaps group them into different structs depending how they are used. Make all the C functions methods of that struct or structs. This works a treat. It means you might end up with a lot of 'self.x, self.y' in your source but that is OK.
Move all the globals into locals of main() and pass them down through your program as parameters.

On suggestion here in one example I started out with 1) but migrated some of the globals to 2) as a potential performance improvement. Which indeed it was.

To my surprise either option performs well. Experiment shows that in some cases using the struct method can be faster in other passing parameters can be faster.

Either way, my Rust rewrites keep up with the old C and even beat it. See the "performance winners" link I gave above. Which is perhaps the most complex thing I have done in Rust, threads and all.

Ultimately you have to stand back and look at your program. See what data needs to be where and when. Then make it so in a Rust friendly way. Rather tan trying to emulate globals.

ZiCog · December 30, 2019, 12:45pm

By the way....

Rust uses LLVM as it compiler back end. What does Nim use?

I ask because In my comparisons of Rust to C I have found that building the C code with LLVM or GCC can make significant performance differences. Especially on ARM.

jzakiya · December 30, 2019, 2:47pm

For my case, the Nim source is converted to C, which then can be compiled as --cc:gcc or --cc:clang (to use GCC or Clang). I compile with GCC 9.2.0.

Thanks for your roadmap to consider.

I think I know where the issue is in twins_sieve. It's apparently very subtle, and I think has to do with how I'm passing one parameter to it. I'll rewrite it to isolate the code to confirm (or not) my hypothesis.

blonk · December 30, 2019, 2:47pm

One probable cause was explained to you earlier in the thread in this post: Nim multithreading faster than Rust - #22 by stevensonmt

Are you certain you have 1:1 mapping on runtime safety checks when running the benchmarks?

wesleywiser · December 30, 2019, 4:38pm

This is very likely the cause of the performance difference. GCC tends to outperform LLVM on a variety of benchmarks.

jzakiya · December 30, 2019, 5:14pm

Hmm, that makes sense, as a fundamental difference that's outside of the source code functionality.

FYI, the Nim binary is also significantly smaller than Rust (Clang binary is smaller than GCC, but GCC is faster, on my system: GCC 9.2.0; Clang 8.0.1)

Rust binary: 2,803,080 bytes      stripped: 350,568 bytes
Nim  binary: 161,936 bytes        stripped: 129,552 bytes

Kruzer · December 30, 2019, 10:23pm

It is very difficult to check the difference in speed because code has to be thoroughly looked into. Somebody might to do it.

Anyhow, are you sure there are no data synchronization differences?

Once I came across a case where C# code seemed significantly faster than ported code to C++. But after investigations it was discovered that C# coder didn't do data synchronization properly, and the bug didn't hit most of the time but occasionally it corrupted the data. After placing proper synchronizations, speed was almost the same. I hope Nim implementation doesn't have this issue.

raidwas · December 30, 2019, 11:26pm

On my laptop when calling the rust version with 10^11 it takes approximately 6.5 seconds, replacing every index operation ([]) inside twins_sieve with get_uncheked or get_unchecked_mut saved about 1 second, resulting in 5.5 seconds. I don't have nim installed, but to me it seems that this might be the source of the "poor" performance you are seeing?
Bound checks can make it difficult for the compiler to optimise the code if it cant prove that they are not needed, possibly preventing it from using specialized instruction sets (like avx or newer sse extensions).
Furthermore you could try passing "RUSTFLAGS="-C target-cpu=native"" to the compiler (for me this didn't help on my laptop, but its older and doesn't support any special newer instruction sets, so i am not surprised).

jzakiya · December 30, 2019, 11:36pm

Could you show an example of how to use get_unchecked inside twins_sieve, or better, post your modified version so I can try on my system?

raidwas · December 30, 2019, 11:45pm

playground

I changed the index operations to use get_unchecked and get_unchecked_mut and also changed the ranges for the for loops from i..=j into i..j+1 which are equal as long as j is not usize::MAX (in which case the first one works as expected and the second one overflows), but the second one can be optimized better in some cases (I did not test the changes individually).

ZiCog · December 31, 2019, 12:01am

Subtle indeed...

I have never been much into optimizing code. I gave up trying to beat compilers with hand crafted assembler a couple of decades ago. Generally fast enough is good enough. If I need speed I use a language that compiles to native code rather than something interpreted. That is already an order of magnitude improvement. If I have found an algorithm that is O(log n) rather than O(n), or whatever, that can be another couple of orders of magnitude improvement. Scratching around for another 10% becomes pointless.

Until this last year....

I found myself partaking in a few coding challenges, pitting various languages and programmer ninja skills against each other.

My conclusion is that such scratching around for 10 or 20 percent is very subtle.

In this modern world it is almost impossible to get a feel for how your code will perform. Processors have caches and pipelines and branch predictors and instruction reordering and parallel instruction dispatch etc, etc... To optimize code you have to keep all of those mechanisms happy.

Then compiler technology has moved on. Who knows how your code will be optimized by the compiler or why it refuses to optimize things.

The upshot of all the above is that seemingly inconsequential changes to ones code can have a dramatic affect on performance. At the +/- 10% level.

Worse still, a gain with one compiler on one architecture can be a loss with a different compiler on a different architecture.

You can end up chasing your tail trying to find an optimum.

Personally I don't think reaching for "unchecked" is satisfactory. I like my code to be safe. The trick is writing it in sch a way that Rust can determine it is safe at compile time and reduce the amount of run time checks required.

jzakiya · December 31, 2019, 5:40am

Amazing! That did the trick.

Running with your modified twins_sieve produces times that are < 1% difference up to at about 10^12 and ~4% up to 10^13 (doing a quick test on a "noisy" system).

I compiled using:

RUSTFLAGS="-C opt-level=3 -C debuginfo=0 -C target-cpu=native" cargo build --release

and just

cargo build --release

From just minimal testing, there doesn't seem to be any statistically consistent speed difference, though the later creates a tad smaller binary.

This actually makes me feel better, in that I was correct in thinking Rust could be faster AND I just didn't know how to do it. That's because the official documentation provides no good tutorials|examples to let you know of the existence of these techniques and how to apply them, and when|why.

I think the whole concept of what is unsafe needs to be presented differently, such as has been done by @cbiffle here - Learn Rust the Dangerous Way - Cliffle - and in this thread:

This topic needs to be presented less dogmatically. Using unsafe techniques doesn't have to mean your code is necessarily dangerous, and will blowup. It should be presented as @cbiffle does, and explain that it is available to use, but be careful, because the compiler won't automatically bail you out.

Thank you @raidwas for letting me know about this technique, and examples how to apply it. I think allot of other people have learned something from this exercise too. (The only other question in my mind is, is there still some way to design|implement the code to get similar improved performance using only strictly safe techniques?)

I would really encourage a more objective discussion of these things in the docs, blogs, videos, etc. As in this case, this can really make a significant performance difference in numerical heavy applications, where the bounds of the problem are clear and defined, so the programmer can squeeze out the last possible microsecond of performance.

BatmanAoD · December 31, 2019, 7:03am

Ah, I see; I had thought the println!() shown in the code snippet in your original post was occurring inside of the algorithm itself.

It looks like the timers aren't stopped until after printing the "...threads done" line in each thread, so I would have thought they might affect the benchmarks. But you're right, I don't see an effect when commenting them out.

josiahking · December 31, 2019, 11:41am

Just wanted to mention, am learning from this discussion.

stevensonmt · December 31, 2019, 3:36pm

I think it is worth mentioning that the fact using the "unsafe" methods of get_unchecked() and get_unchecked_mut() establish that you were not initially doing an apples to apples comparison with regard to safety checks.

Topic		Replies	Views
Help -- Fastest Prime Sieve, in Rust help	22	2390	November 9, 2019
A Friendly Challenge for Rust community	36	4135	January 12, 2023
How to explain slowdown with Sieve of Eratosthenes versus Nim and C? help	12	3250	January 12, 2023
Memory use issues with threads help	5	474	March 24, 2022
A humble request/challenge	10	1825	January 12, 2023

Nim multithreading faster than Rust

Related Topics