How to make this parallel prime sieve faster?

I created code for a sequential prime sieve that works and then (finally got working) the same for a version that does the inner loop in parallel using the crate crossbeam. However, the parallel version is slower. So I would like some help on how (if possible) to make the parallel version faster.

Here is its code.

extern crate crossbeam;

fn main() {
    let residues = [1, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67,
                   71, 73, 79, 83, 89, 97,101,103,107,109,113,121,127,131,137,139,
                  143,149,151,157,163,167,169,173,179,181,187,191,193,197,199,209,211];
    let val = 2_000_000_000;
    let md = 210;
    let rescnt = 48;

    println!("val = {}, mod = {}, rescnt = {}",val,md,rescnt);

    let mut posn = [0; 210];
    for i in 1..rescnt {posn[residues[i]] = i-1;}; posn[1] = rescnt-1;

    let mut modk; let mut r; let mut k;

    let num = val-1 | 1;           // if value even number then subtract 1
    k = num/md; modk = md*k; r=1;  // kth residue group, base num value
    while num >= modk+residues[r] {r += 1;} // find last pc position <= num
    let maxpcs  = k*rescnt + r-1; // maximum number of prime candidates

    let mut prms: Vec<u8> = vec![0; maxpcs];
    
    println!("num = {}, k = {}, modk = {}, maxpcs = {}",num,k,modk,maxpcs);

    let sqrt_n = (num as f32).sqrt() as usize;
 
    modk=0; r=0; k=0;

    // sieve to eliminate nonprimes from primes prms array
    for i in 0..maxpcs {
      r += 1; if r > rescnt {r = 1; modk += md; k += 1;};
      if  prms[i] == 1 {continue;}
      let prm_r = residues[r];
      let prime = modk + prm_r;
      if  prime > sqrt_n {break;}
      let prmstep = prime * rescnt;
      for ri in &residues[1..rescnt+1] {
        let prms = &mut prms;
        crossbeam::scope(|scope| { 
           scope.spawn(move || {        
              let prod = prm_r * ri;
              let mut j = (k*(prime + ri) + (prod-2)/md)*rescnt + posn[prod % md];
              while j < maxpcs {prms[j] = 1; j += prmstep;}
          });
        });
      }
    }
    // the prms array now has all the positions for primes r1..N
    // extract prime numbers and count from prms into prims array
    let mut prmcnt = 4;
    modk=0; r=0;
    for i in 0..maxpcs {
      r += 1; if r > rescnt {r = 1; modk += md;};
      if prms[i] == 0 {prmcnt += 1;}
    }
    println!("{}",prmcnt);
}
1 Like

It's slower because it's not parallel. crossbeam::scope is used to define the scope within which spawned threads will run. In other words, you're creating a scope, spawning and thread, and then blocking until that thread finishes.

You need to put crossbeam::scope outside of at least one loop for it to do any good.

I don't think that's it. What I want is to do the process for each ri value inside the for loop in parallel in a separate thread. The while j < maxpcs {...} loop is inside the thread scope and are being done in parallel. When I run htop I can see the threads running in parallel. If you can show me code I can understand the point better and try it.

I agree with @DanielKeep that in your code at most one thread is running and at most two threads exist at any moment. For me, htop shows 100% CPU usage, which is spread across different cores.

But I am also very curious about how can this be parallelized in Rust. The difficulty is that you want to access prms vector concurrently from different threads without synchronization. It should be safe if you use disjoint sets of indices, but I think that you can't prove to Rust that they are indeed disjoint, so you'll have to use some unsafe code here.

The whole point of crossbeam::scope is that threads started within a scope are finished before the scope ends, which means that the scope has to wait for the thread. Not sure how exactly you determine running in parallel, but it's not going to happen with this example.

@jzakiya Indeed you block the loop every time you create a crossbeam::scope.
The overhead is basically from doing the same job but also spawning threads and waiting for them to finish their job.

That also proves that only 1 thread is working. For example, in a system with 2 cores and hyperthreading it could look like this: 2(cores) * 2(virtual cores) * 100% = 400%. That would mean that every core is loaded.

@jzakiya Please consider using something like rayon.
I think you might be interested in par_iter and the weight method from rayon.

OK, I get the point now about crossbeam blocking further thread execution until the current one is finished.

However, I can share prms among all threads without ever worrying about the same memory location being written to at the same time. All the write addresses to prms are independent within the for loop so there is never that type of race condition.

So what I want to do is have each thread use/borrow (?) prms without restricting its simultaneous use in other threads. Is this considered unsafe in Rust, but can I do this somehow anyway?

This is not considered unsafe in Rust. What you might be interested in is slice's split_at_mut. Also, you can take &mut references into different locations of a slice, Rust is perfectly fine with that.
Again, take a look at rayon, on github, there is the quick_sort example, it mutates a slice in parallel. You don't have to do it recursively how the example does, but it's an option.

Thank you for the suggested resources. I am studying them now.

Conceptually I would like prms to be considered a global variable which can be accessed anytime by any thread automatically. I want the threads to be able to asynchronously update the memory address they compute (which can never collide at anytime) in parallel. So whenever I start a thread it puts a '1' into those memory addresses it computes as soon as it can.

Questions:

  1. Can I create/treat prms as a globalized variable in Rust?
  2. Can I then use it in any thread without blocking its parallel use in other threads?
  3. Can threads be run asynchronously in Rust in this manner?

From what I read Rust makes it very hard to use/borrow a resource by more than one user at a time (and forces synchronous access to them) but that's what I need/want to do to run this algorithm truly in parallel. Is my thinking wrong on this?

That sounds like some kind of god object or the Singleton pattern which is a very bad idea. You want to minimize global state, the more global state you have(and code that accesses that global state) the harder it gets to debug and secure the access to those resources(you can secure it with locks in the case of threads but it will kill the performance or even dead lock if you are not careful). Imagine that anybody can modify that code, if you hit a bug(and you will), you won't know from where to start and what happened because so many parts of the code could of changed the state of that global object.

Again, Rust is perfectly fine of you having mutable references(&mut var) into different locations of a slice("array") as long as it is the only reference for that lifetime, or to have many shared references(&var) into any location as long as the rules of lifetime are followed(for example, for the references to not outlive the resource it is referencing). I might of skipped something, but this is basically it. In any case, there is the book which can explain references better than me.

That's why that rayon example of quick_sort can sort a slice in parallel without any locking mechanism...

What you probably have read is that Rust won't allow you to have 1 &mut reference and any other reference to that same location, because this is risky and one of the main source for bugs.

As long as you will have only 1 &mut reference into a location in prms, Rust will be perfectly fine with that.
If you will want to create another reference after that, only then Rust will not allow you, to save you from horrible bugs.
If you want to have multiple &mut references into the same location, than there is no other way than using a locking mechanisms.

let something: &'static [type_here] = global_stuff;
Or you could use const or static.

Thanks again for your ideas.

From reading the online Rust Book, can I use a raw pointer to prms and share that across asynchronous threads to mimic using it as a global resource, or is that no better than sharing it as a &mut var?

I'm still looking at the Rayon docs, but haven't had the quiet time yet to really study and play with it. I'll start coding implementations with it this week after I do.

Try to avoid unsafe code if you don't really need it.
It is definitely worse than &mut var. In this case, with unsafe code you are basically disabling what's good about Rust for no good reason. You should use unsafe only if there is no other way and you know very well what you are doing(can reason about the issues that can appear).

Well, I finally got a version using threads, etc to work (give the correct answers) but it's 2.5 times slower than the version using crossbeam. I'm resigning myself that Rust currently may not be able to perform a parallel version faster than the serial|sequential version, unless someone can show me otherwise.

use std::thread;
use std::sync::{Arc, Mutex};

fn main() {
    let residues = [1, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67,
                   71, 73, 79, 83, 89, 97,101,103,107,109,113,121,127,131,137,139,
                  143,149,151,157,163,167,169,173,179,181,187,191,193,197,199,209,211];

    let val= 1_000_000_000;
    let md = 210;
    let rescnt = 48;

    println!("val = {}, mod = {}, rescnt = {}",val,md,rescnt);

    let mut posn = [0; 210];
    for i in 0..rescnt {posn[residues[i]] = i-1;};

    let mut modk; let mut r; let mut k;

    let num = val-1 | 1;           // if value even number then subtract 1
    k = num/md; modk = md*k; r=1;  // kth residue group, base num value
    while num >= modk+residues[r] {r += 1;} // find last pc position <= num
    let maxpcs  = k*rescnt + r-1; // maximum number of prime candidates
    //let prms: Vec<u8> = vec![0; maxpcs];
    let prms = Arc::new(Mutex::new(vec![0; maxpcs]));  // I can't figure out how to use the byte vector here

    println!("num = {}, k = {}, modk = {}, maxpcs = {}",num,k,modk,maxpcs);

    let sqrt_n = (num as f32).sqrt() as usize;
    modk=0; r=0; k=0;

    // sieve to eliminate nonprimes from small primes prms array
    for i in 0..maxpcs {
      r += 1; if r > rescnt {r = 1; modk += md; k += 1;};
      { //I have to do this in order to index into prms
        let prms = prms.lock().unwrap();
        if  prms[i] == 1 {continue;}
      }
      let prm_r = residues[r];
      let prime = modk + prm_r;
      if  prime > sqrt_n {break;}
      let prmstep = prime * rescnt;
      let mut handles = vec![];
      for i in 1..rescnt+1 {
          let ri = residues[i];
          let prms = prms.clone();
          handles.push(thread::spawn(move || {
          let mut prms = prms.lock().unwrap();
            let prod = prm_r * ri;
            let mut nonprm = (k*(prime + ri) + prod/md)*rescnt + posn[prod % md];
            while nonprm < maxpcs {prms[nonprm] = 1; nonprm += prmstep;};
          }));
      }
      for handle in handles {let _ = handle.join();}
      //for handle in handles {handle.join().map_err(|_| "Could not join a thread!").unwrap();}
    }
    // the prms array now has all the positions for primes r1..N
    // extract prime numbers and count from prms into prims array
    let mut prmcnt = 4;

    for i in 0..maxpcs {
      {
        let prms = prms.lock().unwrap();
        if prms[i] == 0 {prmcnt += 1;}
      }
    }
    println!("{}",prmcnt);
}

Not that "Rust may not be able to perform a parallel version faster", the code uses locks excessively.
vec![0; maxpcs] is equivalent to vec![0i32; maxpcs](I think the default was i32, don't remember for sure).
You want to write vec![0u8; maxpcs].

Also, you were saying before that there will always be only one reference to a single element in the array:

This disproves that assumption. Even though there are no references, still, nonprm will go over most of the slice...

This version isn't parallel either: each thread immediately grabs the lock upon spawning and keeps it locked until it's done.

In any case, using locks here will be the slowest version of all, since the threads don't actually do much that doesn't require the lock held.

You misunderstood what I said. I said all the addresses computed and written to are distinct|unique, thus there can never be a write collision to the same address.

I think you could better parallelize this by splitting the writable parts into chunks. Compute all the small primes first, in serial if need be as they're much fewer (need only sqrt(n)). Then chunk up the large primes and sieve each chunk in parallel. Roughly:

use rayon::prelude::*;
/* first compute the small primes, then: */
{
    let (small_prms, large_prms) = prms.split_at_mut(/* index of small/large boundary */);
    let small_prms = &*small_prms; /* drop to immutable `&` so it may be shared */
    large_prms.par_chunks_mut(/* tunable chunk size */)
        .weight_max() /* ensure every single chunk is a separate rayon job */
        .enumerate() /* so you can figure out where in prms this chunk came from */
        .for_each(|(i, chunk)| {
            /* sieve this chunk with small_prms. (this runs in parallel - no locking required!) */
        });
}
/* now bask in your completed prms! */

edit: I think this is basically what @matklad hinted at, using "disjoint sets of indices".

1 Like

More or less so. There are different ways to split array into disjoint parts. The only way naturally expressible by rust is (items with index less then x, items with index greater then x). Things like (odd numbered elements, even numbered elements) are, to my understanding, impossible. You can, of course, use unsafe code for such cases (but I was unable to come up with a correct combination of UnsafeCells, * mut T and unsage impl Send for this particuar use case. I would really appreciate an example of mutably sharing an array between two threads, while writing to odd locations from one thread, and to even locations from another).

But I think that in general it's much better to restructure a problem for divide an conquer framework, rather then inventing some mathematically proven to be correct unsafe indexing.

Even if you could safely make parallel interleaved mutations, that's a pretty nasty scenario for the CPU caches -- likely to trigger false sharing.

3 Likes

Looking at how you're using it, rather than Mutex<Vec<u8>> you could use Vec<Mutex<u8>> or even Vec<AtomicBool> or something... though any of those options will increase your memory usage quite substantially. If that's a problem you could try some kind of atomic-ish bitvec that makes better use of every bit of the atomic variable, maybe.

pub struct BitVec {
    storage: Vec<AtomicUsize>
}
impl BitVec {
    pub fn with_bits(bits: usize, init: bool) -> BitVec {
        let size = div_round_up(bits, val_size());
        let mut vec = Vec::with_capacity(size);
        for _ in 0..size {
            vec.push(AtomicUsize::new(if init { !0 } else { 0 }));
        }
        BitVec {
            storage: vec
        }
    }
    pub fn set(&self, bit: usize, to: bool) {
        let idx = bit / val_size();
        let val = &self.storage[idx];
        let bit_mask = 1 << (bit % val_size());
        if to {
            val.fetch_or(bit_mask, Ordering::Relaxed);
        } else {
            val.fetch_and(!bit_mask, Ordering::Relaxed);
        }
    }
    pub fn get(&self, bit: usize) -> bool {
        let idx = bit / val_size();
        let val = &self.storage[idx];
        let bit_mask = 1 << (bit % val_size());
        val.load(Ordering::Relaxed) & bit_mask == bit_mask
    }
}
#[inline] fn val_size() -> usize { size_of::<AtomicUsize>() * 8 }
#[inline] fn div_round_up(lhs: usize, rhs: usize) -> usize { (lhs + (rhs - 1)) / rhs }