# How to make this parallel prime sieve faster?

#1

I created code for a sequential prime sieve that works and then (finally got working) the same for a version that does the inner loop in parallel using the crate `crossbeam`. However, the parallel version is slower. So I would like some help on how (if possible) to make the parallel version faster.

Here is its code.

``````extern crate crossbeam;

fn main() {
let residues = [1, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67,
71, 73, 79, 83, 89, 97,101,103,107,109,113,121,127,131,137,139,
143,149,151,157,163,167,169,173,179,181,187,191,193,197,199,209,211];
let val = 2_000_000_000;
let md = 210;
let rescnt = 48;

println!("val = {}, mod = {}, rescnt = {}",val,md,rescnt);

let mut posn = [0; 210];
for i in 1..rescnt {posn[residues[i]] = i-1;}; posn[1] = rescnt-1;

let mut modk; let mut r; let mut k;

let num = val-1 | 1;           // if value even number then subtract 1
k = num/md; modk = md*k; r=1;  // kth residue group, base num value
while num >= modk+residues[r] {r += 1;} // find last pc position <= num
let maxpcs  = k*rescnt + r-1; // maximum number of prime candidates

let mut prms: Vec<u8> = vec![0; maxpcs];

println!("num = {}, k = {}, modk = {}, maxpcs = {}",num,k,modk,maxpcs);

let sqrt_n = (num as f32).sqrt() as usize;

modk=0; r=0; k=0;

// sieve to eliminate nonprimes from primes prms array
for i in 0..maxpcs {
r += 1; if r > rescnt {r = 1; modk += md; k += 1;};
if  prms[i] == 1 {continue;}
let prm_r = residues[r];
let prime = modk + prm_r;
if  prime > sqrt_n {break;}
let prmstep = prime * rescnt;
for ri in &residues[1..rescnt+1] {
let prms = &mut prms;
crossbeam::scope(|scope| {
scope.spawn(move || {
let prod = prm_r * ri;
let mut j = (k*(prime + ri) + (prod-2)/md)*rescnt + posn[prod % md];
while j < maxpcs {prms[j] = 1; j += prmstep;}
});
});
}
}
// the prms array now has all the positions for primes r1..N
// extract prime numbers and count from prms into prims array
let mut prmcnt = 4;
modk=0; r=0;
for i in 0..maxpcs {
r += 1; if r > rescnt {r = 1; modk += md;};
if prms[i] == 0 {prmcnt += 1;}
}
println!("{}",prmcnt);
}
``````
1 Like

#2

Itâs slower because itâs not parallel. `crossbeam::scope` is used to define the scope within which spawned threads will run. In other words, youâre creating a scope, spawning and thread, and then blocking until that thread finishes.

You need to put `crossbeam::scope` outside of at least one loop for it to do any good.

0 Likes

#3

I donât think thatâs it. What I want is to do the process for each `ri` value inside the `for` loop in parallel in a separate thread. The `while j < maxpcs {...}` loop is inside the thread scope and are being done in parallel. When I run `htop` I can see the threads running in parallel. If you can show me code I can understand the point better and try it.

0 Likes

#4

I agree with @DanielKeep that in your code at most one thread is running and at most two threads exist at any moment. For me, `htop` shows 100% CPU usage, which is spread across different cores.

But I am also very curious about how can this be parallelized in Rust. The difficulty is that you want to access `prms` vector concurrently from different threads without synchronization. It should be safe if you use disjoint sets of indices, but I think that you canât prove to Rust that they are indeed disjoint, so youâll have to use some unsafe code here.

0 Likes

#5

The whole point of `crossbeam::scope` is that threads started within a scope are finished before the scope ends, which means that the scope has to wait for the thread. Not sure how exactly you determine running in parallel, but itâs not going to happen with this example.

0 Likes

#6

@jzakiya Indeed you block the loop every time you create a `crossbeam::scope`.
The overhead is basically from doing the same job but also spawning threads and waiting for them to finish their job.

That also proves that only 1 thread is working. For example, in a system with 2 cores and hyperthreading it could look like this: 2(cores) * 2(virtual cores) * 100% = 400%. That would mean that every core is loaded.

@jzakiya Please consider using something like rayon.
I think you might be interested in par_iter and the weight method from rayon.

0 Likes

#7

OK, I get the point now about `crossbeam` blocking further thread execution until the current one is finished.

However, I can share `prms` among all threads without ever worrying about the same memory location being written to at the same time. All the write addresses to `prms` are independent within the `for` loop so there is never that type of race condition.

So what I want to do is have each thread use/borrow (?) `prms` without restricting its simultaneous use in other threads. Is this considered `unsafe` in Rust, but can I do this somehow anyway?

0 Likes

#8

This is not considered unsafe in Rust. What you might be interested in is sliceâs split_at_mut. Also, you can take `&mut` references into different locations of a slice, Rust is perfectly fine with that.
Again, take a look at rayon, on github, there is the quick_sort example, it mutates a slice in parallel. You donât have to do it recursively how the example does, but itâs an option.

0 Likes

#9

Thank you for the suggested resources. I am studying them now.

Conceptually I would like `prms` to be considered a global variable which can be accessed anytime by any thread automatically. I want the threads to be able to asynchronously update the memory address they compute (which can never collide at anytime) in parallel. So whenever I start a thread it puts a â1â into those memory addresses it computes as soon as it can.

Questions:

1. Can I create/treat `prms` as a globalized variable in Rust?
2. Can I then use it in any thread without blocking its parallel use in other threads?
3. Can threads be run asynchronously in Rust in this manner?

From what I read Rust makes it very hard to use/borrow a resource by more than one user at a time (and forces synchronous access to them) but thatâs what I need/want to do to run this algorithm truly in parallel. Is my thinking wrong on this?

0 Likes

#10

That sounds like some kind of god object or the `Singleton` pattern which is a very bad idea. You want to minimize global state, the more global state you have(and code that accesses that global state) the harder it gets to debug and secure the access to those resources(you can secure it with locks in the case of threads but it will kill the performance or even dead lock if you are not careful). Imagine that anybody can modify that code, if you hit a bug(and you will), you wonât know from where to start and what happened because so many parts of the code could of changed the state of that global object.

Again, Rust is perfectly fine of you having mutable references(&mut var) into different locations of a slice(âarrayâ) as long as it is the only reference for that lifetime, or to have many shared references(&var) into any location as long as the rules of lifetime are followed(for example, for the references to not outlive the resource it is referencing). I might of skipped something, but this is basically it. In any case, there is the book which can explain references better than me.

Thatâs why that `rayon` example of `quick_sort` can sort a slice in parallel without any locking mechanismâŚ

What you probably have read is that Rust wonât allow you to have 1 `&mut` reference and any other reference to that same location, because this is risky and one of the main source for bugs.

As long as you will have only 1 `&mut` reference into a location in `prms`, Rust will be perfectly fine with that.
If you will want to create another reference after that, only then Rust will not allow you, to save you from horrible bugs.
If you want to have multiple `&mut` references into the same location, than there is no other way than using a locking mechanisms.

`let something: &'static [type_here] = global_stuff;`
Or you could use const or static.

0 Likes

#11

Thanks again for your ideas.

From reading the online Rust Book, can I use a `raw pointer` to `prms` and share that across asynchronous threads to mimic using it as a global resource, or is that no better than sharing it as a `&mut var`?

Iâm still looking at the `Rayon` docs, but havenât had the quiet time yet to really study and play with it. Iâll start coding implementations with it this week after I do.

0 Likes

#12

Try to avoid `unsafe` code if you donât really need it.
It is definitely worse than `&mut var`. In this case, with `unsafe` code you are basically disabling whatâs good about Rust for no good reason. You should use `unsafe` only if there is no other way and you know very well what you are doing(can reason about the issues that can appear).

0 Likes

#13

Well, I finally got a version using threads, etc to work (give the correct answers) but itâs 2.5 times slower than the version using `crossbeam`. Iâm resigning myself that Rust currently may not be able to perform a parallel version faster than the serial|sequential version, unless someone can show me otherwise.

``````use std::thread;
use std::sync::{Arc, Mutex};

fn main() {
let residues = [1, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67,
71, 73, 79, 83, 89, 97,101,103,107,109,113,121,127,131,137,139,
143,149,151,157,163,167,169,173,179,181,187,191,193,197,199,209,211];

let val= 1_000_000_000;
let md = 210;
let rescnt = 48;

println!("val = {}, mod = {}, rescnt = {}",val,md,rescnt);

let mut posn = [0; 210];
for i in 0..rescnt {posn[residues[i]] = i-1;};

let mut modk; let mut r; let mut k;

let num = val-1 | 1;           // if value even number then subtract 1
k = num/md; modk = md*k; r=1;  // kth residue group, base num value
while num >= modk+residues[r] {r += 1;} // find last pc position <= num
let maxpcs  = k*rescnt + r-1; // maximum number of prime candidates
//let prms: Vec<u8> = vec![0; maxpcs];
let prms = Arc::new(Mutex::new(vec![0; maxpcs]));  // I can't figure out how to use the byte vector here

println!("num = {}, k = {}, modk = {}, maxpcs = {}",num,k,modk,maxpcs);

let sqrt_n = (num as f32).sqrt() as usize;
modk=0; r=0; k=0;

// sieve to eliminate nonprimes from small primes prms array
for i in 0..maxpcs {
r += 1; if r > rescnt {r = 1; modk += md; k += 1;};
{ //I have to do this in order to index into prms
let prms = prms.lock().unwrap();
if  prms[i] == 1 {continue;}
}
let prm_r = residues[r];
let prime = modk + prm_r;
if  prime > sqrt_n {break;}
let prmstep = prime * rescnt;
let mut handles = vec![];
for i in 1..rescnt+1 {
let ri = residues[i];
let prms = prms.clone();
handles.push(thread::spawn(move || {
let mut prms = prms.lock().unwrap();
let prod = prm_r * ri;
let mut nonprm = (k*(prime + ri) + prod/md)*rescnt + posn[prod % md];
while nonprm < maxpcs {prms[nonprm] = 1; nonprm += prmstep;};
}));
}
for handle in handles {let _ = handle.join();}
//for handle in handles {handle.join().map_err(|_| "Could not join a thread!").unwrap();}
}
// the prms array now has all the positions for primes r1..N
// extract prime numbers and count from prms into prims array
let mut prmcnt = 4;

for i in 0..maxpcs {
{
let prms = prms.lock().unwrap();
if prms[i] == 0 {prmcnt += 1;}
}
}
println!("{}",prmcnt);
}

``````
0 Likes

#14

Not that âRust may not be able to perform a parallel version fasterâ, the code uses locks excessively.
`vec![0; maxpcs]` is equivalent to `vec![0i32; maxpcs]`(I think the default was i32, donât remember for sure).
You want to write `vec![0u8; maxpcs]`.

Also, you were saying before that there will always be only one reference to a single element in the array:

This disproves that assumption. Even though there are no references, still, `nonprm` will go over most of the sliceâŚ

0 Likes

#15

This version isnât parallel either: each thread immediately grabs the lock upon spawning and keeps it locked until itâs done.

In any case, using locks here will be the slowest version of all, since the threads donât actually do much that doesnât require the lock held.

0 Likes

#16

You misunderstood what I said. I said all the addresses computed and written to are distinct|unique, thus there can never be a write collision to the same address.

0 Likes

#17

I think you could better parallelize this by splitting the writable parts into chunks. Compute all the small primes first, in serial if need be as theyâre much fewer (need only sqrt(n)). Then chunk up the large primes and sieve each chunk in parallel. Roughly:

``````use rayon::prelude::*;
/* first compute the small primes, then: */
{
let (small_prms, large_prms) = prms.split_at_mut(/* index of small/large boundary */);
let small_prms = &*small_prms; /* drop to immutable `&` so it may be shared */
large_prms.par_chunks_mut(/* tunable chunk size */)
.weight_max() /* ensure every single chunk is a separate rayon job */
.enumerate() /* so you can figure out where in prms this chunk came from */
.for_each(|(i, chunk)| {
/* sieve this chunk with small_prms. (this runs in parallel - no locking required!) */
});
}
/* now bask in your completed prms! */
``````

edit: I think this is basically what @matklad hinted at, using âdisjoint sets of indicesâ.

1 Like

#18

More or less so. There are different ways to split array into disjoint parts. The only way naturally expressible by rust is (items with index less then x, items with index greater then x). Things like (odd numbered elements, even numbered elements) are, to my understanding, impossible. You can, of course, use unsafe code for such cases (but I was unable to come up with a correct combination of `UnsafeCells`, `* mut T` and `unsage impl Send` for this particuar use case. I would really appreciate an example of mutably sharing an array between two threads, while writing to odd locations from one thread, and to even locations from another).

But I think that in general itâs much better to restructure a problem for divide an conquer framework, rather then inventing some mathematically proven to be correct unsafe indexing.

0 Likes

#19

Even if you could safely make parallel interleaved mutations, thatâs a pretty nasty scenario for the CPU caches â likely to trigger false sharing.

2 Likes

#20

Looking at how youâre using it, rather than `Mutex<Vec<u8>>` you could use `Vec<Mutex<u8>>` or even `Vec<AtomicBool>` or somethingâŚ though any of those options will increase your memory usage quite substantially. If thatâs a problem you could try some kind of atomic-ish bitvec that makes better use of every bit of the atomic variable, maybe.

``````pub struct BitVec {
storage: Vec<AtomicUsize>
}
impl BitVec {
pub fn with_bits(bits: usize, init: bool) -> BitVec {
let size = div_round_up(bits, val_size());
let mut vec = Vec::with_capacity(size);
for _ in 0..size {
vec.push(AtomicUsize::new(if init { !0 } else { 0 }));
}
BitVec {
storage: vec
}
}
pub fn set(&self, bit: usize, to: bool) {
let idx = bit / val_size();
let val = &self.storage[idx];
let bit_mask = 1 << (bit % val_size());
if to {
val.fetch_or(bit_mask, Ordering::Relaxed);
} else {
val.fetch_and(!bit_mask, Ordering::Relaxed);
}
}
pub fn get(&self, bit: usize) -> bool {
let idx = bit / val_size();
let val = &self.storage[idx];
let bit_mask = 1 << (bit % val_size());
val.load(Ordering::Relaxed) & bit_mask == bit_mask
}
}
#[inline] fn val_size() -> usize { size_of::<AtomicUsize>() * 8 }
#[inline] fn div_round_up(lhs: usize, rhs: usize) -> usize { (lhs + (rhs - 1)) / rhs }
``````
0 Likes