Nim multithreading faster than Rust

I originally did a multi-threaded twin primes sieve algorithm in Nim, then translated it to Rust. They are functionally 1-to-1 equivalent, but the Rust version is significantly slower (even slower than an equivalent D version) because of how it does its multi-threading, using Rayon. Below are timing differences (in seconds).

input  |   Nim    |   Rust 
-------|----------|--------
10^10  |    0.406 |   0.563
-------|----------|--------
10^11  |    4.291 |   5.525
-------|----------|--------
10^12  |   47.343 |  54.785
-------|----------|--------
10^13  |  565.291 | 645.235

Here are links to the gist files for both versions.

Nim

Rust

I received lots or help from this forum to get the Rust version to this point, but the multi-threading in Nim is much simpler, and faster than for Rust. Below is threading code section for each.

Nim

  parallel:                        # perform in parallel
    for indx, r_hi in restwins:    # for each twin pair row index
      spawn twins_sieve(indx, r_hi.uint) # sieve selected twin pair restracks
      stdout.write("\r", (indx + 1), " of ", pairscnt, " threads done")
  sync()                           # when all the threads finish

Rust

  let (lastwins, cnts): (Vec<_>, Vec<_>) = {
    let counter = RelaxedCounter::new();
    restwins.par_iter()
       .map( |r_hi| {
          let out = twins_sieve(*r_hi, kmin, kmax, kb, start_num, end_num, modpg, s,
          &primes.to_vec(), &resinvrs);
          print!("\r{} of {} threads done", counter.increment(), pairscnt);
          out
       }).unzip()
  };

This is my first really serious Rust project, and I assume Rust must be able to do this algorithm faster than I have it so far. How can I make the multi-threading, et al, more performant to be closer to Nim? (Nim actually transcodes to C, which is then compiled.)

Compiled with (current) Nim 1.0.4, gcc 9.2.0, and Rust 1.40, on Linux 64-bit distro, I7 cpu, 2.6 - 3.5 GHz, 4 cores/8 threads, 16GB mem.

1 Like

Hm. I have

Topology: Quad Core model: Intel Core i7-7700HQ bits: 64 type: MT MCP arch: Kaby Lake rev: 9 L2 cache: 6144 KiB 
flags: avx avx2 lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx bogomips: 44798 
Speed: 1100 MHz min/max: 800/3800 MHz Core speeds (MHz): 1: 1100 2: 1100 3: 1100 4: 1100 5: 1100 6: 1100 7: 1100 
           8: 1100 

The system is a 64-bit Fedora 31 system.

When I run for 10^12 then the Rust version takes: total time = 60.218404173 secs
The nim version hangs forever:

threads = 8
each thread segment is [1 x 100352] bytes array
twinprime candidates = 49450550490; resgroups = 33300034
each 1485 threads has nextp[2 x 78492] array
setup time = 0.013546 secs
perform twinprimes ssoz sieve
1461 of 1485 threads done
3 Likes

Possibly something is messed up with nim's multithreading.
Well, I hate to be 'I-told-you-so' guy, but Rust's thread safety is one of the reasons I hope to never write C code again.

5 Likes

For Nim >= 1.0 compile as:

$ nim c --cc:gcc --d:danger --threads:on twinprimes_ssoz.nim

After version Nim 0.19.4 it changed it's threading model, and sometimes it hangs. Abort and rerun works on my system. You can also use Nim's choosenim installler and install Nim 0.19.4, which doesn't have the problem. It allows you to easily switch between Nim versions (like RVM, et al, for Ruby).

To compile with Nim 0.19.4 must do:

$ nim c --cc:gcc --d:release --threads:on twinprimes_ssoz.nim

Again, random thread contention problems start after 0.19.4.

That issues aside, Nim's threading (with gc for each thread) is much simpler and faster than at least Rayon's threading model. Otherwise, the rest of the Rust code is comparable (and maybe a smidgen faster?).

1 Like

Have you compared them single-threaded? i.e. compare, profile, and optimize with a plain Iterator first, and then flip to rayon par_iter().

edit: on my laptop, well over 99% is in your twins_seive function, where rayon is no longer involved.

2 Likes

Before I tried multi-threading in Rust I did single threading to establish the implementation details worked (to get correct answers). Once that passed the compiler the task then was to create a multi-threaded implementation. This threading implementation using Rayon works (gives correct answers) but was the result of forum input. I have no idea how to produce a faster (correct) implementation with my current limited knowledge of Rust threading mechanics details.

My point is that I don't think threading is the reason for your performance difference.

Even in release mode, Rust has some safety checks that may hamper performance in comparison. For example, bounds checking on array/slice/vec indexing can slow you down.

3 Likes

I should add that nim is: Nim Compiler Version 1.0.4 [Linux: amd64]

and I compiled with nim c --cc:gcc --d:danger --threads:on twinprimes_ssoz.nim

All the dynamic processing occurs within the proc|routine twin_sieve. I know the speed difference between the Nim (D) versions is in the Rust threading implementation used in the program. I was hoping if someone knew how to make it faster they would rewrite it to be faster.

If there are other possible speedups it would be nice to provide working examples I can run myself. There are loads of stuff I don't know about Rust, and showing real examples would be the best way for me|others to learn them.

1 Like

What makes you so sure that the threading implementation is the problem? My own perf report shows nothing of rayon in the profile.

I think something with the Nim version is wrong. Even when using 10^10 it happens that sometimes the program hangs, and sometimes it goes thru with 0.42 or so total time.

BTW: if -d:danger means that all checkings are disabled and Rust's boundary checkings cannot be disabled then the runtime comparison doesn't make much sense.

from NIM docs: Additional compilation switches

There is nothing wrong with the Nim version, it produces correct results, with the times shown. You can compile it with Nim 0.19.4 to eliminate random thread contentions.

The issue at hand is not the Nim version but the Rust implementation.
How do you make the Rust version faster??

1 Like

So run the Nim version without the -d:danger switch to get a fair comparison and report that difference. Or persist with unsafe, unchecked Nim code, but don't compare it to safe Rust.

As reported I have an issue when running the Nim version (compiled with nim 1.0.4) as it hangs always when trying 10^11, and it hangs sometimes when trying with 10^10

Follow the instructions at the link, to load choosenim, and use it to install 0.19.4.

Nim and Rust run fine on my system, and both produce identical numerical results.

Let's forget there even is a Nim version, and focus on making the Rust version faster.

When I compile using nim 0.19.4 then there are no hangs.

As I stated, after 0.19.4 Nim introduced a bug in its threading implementation.
I would suggest raising the issues you've encountered in the Nim forum so they can address it.

https://forum.nim-lang.org/

I have no issue to report because I am not a Nim user. I just used Nim 1.0.4 because you told in your starting post that you used that version.

Rust's string-formatting is, I believe, slower than one would expect. Could you please try saving the results in a data structure and only printing them (in both implementations) after the parallel threads have finished?