How to make this parallel prime sieve faster?

I don't think Vec<AtomicBool> would increase memory usage; AtomicBool is just a bool, only different instructions are generated for accessing it. It would probably not result in a speedup compared to serial access.

This algorithm in its current form might just not be suited for parallelizing. There might not enough being done in the inner loop to justify the overheads. I just tested a version using a raw *mut u8 C-style array and unsafe access, and it was a mere 10% faster than serial with 4 cores.

@jzakiya, do you have a parallel version of this algorithm in another language? That would be really interesting.

Can you try it with bigger input, say let val = 4_000_000_000 to see the difference.

I have a C++ version using OpenMP to parallelize it, which gives the correct answer but also doesn't give the expected speedup (probably my fault). That code is circa 2014 and there's a newer version of OpenMP which I haven't tried it with yet.

Today used the crate rayon and tried to write the inner loop to conform to it as shown.

    for i in 0..maxpcs {
      r += 1; if r > rescnt {r = 1; modk += md; k += 1;};
      if  prms[i] == 1 {continue;}
      let prm_r = residues[r];
      let prime = modk + prm_r;
      if  prime > sqrt_n {break;}
      let prmstep = prime * rescnt;
      residues[1..rescnt+1].par_iter()
         .map(|ri| (k*(prime + ri) + (prm_r*ri)/md)*rescnt + posn[(prm_r*ri) % md])
         .map(|j| while j < maxpcs {prms[j] = 1; j += prmstep;})
    }

The compiler complains, but it seems like it should be serviceable.

  Compiling sozP7par3 v0.1.0 (file:///home/jzakiya/projects/sozP7par3)
src/main.rs:45:7: 47:65 error: mismatched types:
 expected `()`,
    found `rayon::par_iter::map::Map<rayon::par_iter::map::Map<rayon::par_iter::slice::SliceIter<'_, usize>, [closure@src/main.rs:46:15: 46:83 k:_, prime:_, prm_r:_, md:_, rescnt:_, posn:_]>, [closure@src/main.rs:47:15: 47:64 maxpcs:_, prms:_, prmstep:_]>`
(expected (),
    found struct `rayon::par_iter::map::Map`) [E0308]
src/main.rs:45       residues[1..rescnt+1].par_iter()
src/main.rs:46          .map(|ri| (k*(prime + ri) + (prm_r*ri)/md)*rescnt + posn[(prm_r*ri) % md])
src/main.rs:47          .map(|j| while j < maxpcs {prms[j] = 1; j += prmstep;})
src/main.rs:45:7: 47:65 help: run `rustc --explain E0308` to see a detailed explanation
error: aborting due to previous error

Just like regular iterators, rayon's map doesn't actually do anything until you consume it. Try for_each instead for the last bit.

But I think it still will fail to compile. Writing directly to prms makes the closure FnMut. Rayon needs Fn to be able to share it among threads.

This is why I suggested dividing the work by mutable chunks. Then there is no mutable aliasing, and the compiler will know that.

I don't think Vec<AtomicBool> would increase memory usage; AtomicBool is just a bool, only different instructions are generated for accessing it.

It's a usize. I don't know why that is though.

Then you're right, of course. I assume it's because of alignment requirements for atomic access...

I wrote my own version from scratch to demonstrate parallel chunks. I only sieved odds to keep it simple, rather than using a higher modulus and residue like yours, but it should be possible to do more. Call it an exercise for the reader.

My results:

  serial: 5218023786 ns
parallel: 741314228 ns
-> 7.04x speedup

This is on an i7-4600U -- dual-core with hyperthreading, so only 4 logical CPUs. Bonus points if you figure out how this achieved a super-linear speedup!

(Note that this does require current rayon.git for the as-yet unpublished par_chunks_mut.)

1 Like

Very nice!

I guess that if you've sieved out more candidates before, any given candidate will take less time than it would otherwise?

I tried to run it on my system (32-bit Linux with Rust 1.6) and it threw an error saying no method par_chunks_mut. This is with rayon 0.3.1. How do I get|install the right version?

In your Cargo.toml:
Instead of

[dependencies]
rayon = "[the version you've put here]"

You should say:

[dependencies]
rayon = { git = "https://github.com/nikomatsakis/rayon" }

And then execute: cargo update

I've submitted mine as a demo in rayon#68.

The apparent super-linear speedup came from comparing apples and oranges. I added an intermediate implementation which processes chunks in serial, so its execution is very similar to the parallel version, and that claims most of the speedup.

  serial: 4579387556 ns
  chunks: 1381673098 ns -> 3.31x speedup
parallel:  715523031 ns -> 1.93x speedup

The chunks are so beneficial because they can fit entirely in the cache as you're jumping around clearing primes. Then moving to the next chunk loads a new working set into the cache. If the chunks are small enough that multiple threads can have their working set in cache simultaneously, then this is good for parallelism too.

2 Likes