Run the Sieve of Eratosthenes algorithm in parallel in rust


#1

Hi,

I’am relatively new to rust. What really catch my attention is the “threads without data races” features so i thought let try this new beast :slight_smile: .

Recently i saw the “Sieve of Eratosthenes” algorithm which computes a bunch of prime numbers written in C++ and TBB and i thought it would be good to compare rust against it.

I have wrote the sequential version of the algorithm in rust but i don’t know how to parallelize it.

Sequential version:

fn sieve_of_eratosthenes(n: i32) {
    let mut vec = vec![1; (n+1) as usize];

    let mut i: i32 = 2;
    while i*i <= n
    {
        if vec[i as usize] == 1
        {
            //parallelize this while loop
            let mut j = 2*i;
            while j < n {
                vec[j as usize] = 0;
                j += i;    
            }
        }
        
        i+=1;
    }
}

To stay close to the C++ TBB library i tried the simple_parallel library.

Example from simple_parallel docs:

simple_parallel::for_(data.iter_mut().enumerate(), |(i, elem)| {
    *elem = i as i32;
});

But i could not find any possibility to set the start value and the step size in the parallel for loop which i need for this problem. :confused:

Can anyone help me?


#2

simple_parallel::for_ takes an arbitrary iterator; so you’ll need to find some slice iterator that allows “striding” i.e. step size larger than 1. One example is itertools::StrideMut.

Setting the start is easy with slicing, e.g. data[2 * i..].iter_mut() will start iterating from data[2 * i].

Putting them together, something like simple_parallel::for_(StrideMut::from_slice(&mut data[2 *i..], i), |entry| { ... }) may work.

That said, such a tiny unit of concurrency is highly unlikely to be faster, especially with simple_parallel::for_, since the overhead of spawning a thread for each item will much much larger than just running it sequentially. Using the for_ method on a simple_parallel::Pool will probably better, but it will still have a lot more overhead than a pointer offset and write, which is all that the sequential code is doing. One may find it works better by splitting the array into longer chunks (e.g. data[2 * i..].chunks_mut(1000 * i) will yield subslices each with upto 1000 things to zero, which can then all be zeroed as a single unit of concurrency, with the plain while loop or even StrideMut).

(Incidentally you can start from j = i * i, not just j = 2 * i. :smile: And Rust also has a proper bool type with true/false values… although I guess if you want to match the C++ exactly, you can’t change either of these.)


#3

Not sure you picked a good example to play with multi-threading. The problem is that your “vec” element wants to be accessed by multiple threads. The only way that can “safely” happen is if they only have immutable access, but they need to change the values in the vector, so that won’t work…

One solution I came up with is to use a “collector” thread that collects the non prime values and updates the array, and each value of “i” gets it’s own thread that calculates the non primes:

use std::thread::{self, JoinHandle};
use std::sync::mpsc;

fn sieve_of_eratosthenes(n : usize) -> JoinHandle<()> {
	
	// Create the channel to communicate with the collector thread on.
	let (collector_tx, collector_rx) = mpsc::channel();

	// Create the collector thread that receives the non prime values
	let wait = thread::spawn(move|| {
		let mut vec = vec![true; n+1];
		
		// While the receiver is active, clear the non primes.
		while let Ok(o) = collector_rx.recv() {
			vec[o] = false;
		}
		
		// Receiver shutdown, print out the primes.
		for i in 2..n {
			if vec[i] {
				println!("Prime: {}", i);
			}
		} 		 		
	});
	
	let mut i = 2;
	
	while i * i <= n {
		// Clone the transmitter for the worker thread.
		let worker_tx = collector_tx.clone();
		
		// Spawn a worker thread to clear the non primes for i
		thread::spawn(move || {
			let mut j = 2 * i;
			
			while j < n {
				worker_tx.send(j);
				j += i
			}			
		});
		
		i += 1;
	}
	
	// Return the thread object for the collector so we can wait for it to complete
	wait
}

fn main() {
	sieve_of_eratosthenes(1024 * 1024).join();
}

Not sure how efficient this is compared to the single threaded version. One down side is in the main loop I’m not able to check if vec[i] non prime because if a thread has mutable access to the vector, then only that thread can access the vector.


#4

Just because I don’t want to do what I should be doing, this is probably some more “normal” multi-threaded code, using a mutex to so that multiple threads can update the vector. However the loops are pretty tight so I would imagine that the threads spend too much time waiting for access.

use std::thread;
use std::sync::{mpsc, Arc, Mutex};
    
struct Data {
    data : Mutex<Vec<bool>>
}        

impl Data {
    pub fn init(&self, n : usize) {
        let mut vec = self.data.lock().unwrap();
        
        for _ in 0..n+1 {
            vec.push(true);
        }
    }
    
    pub fn is_set(&self, i : usize) -> bool{
        let vec = self.data.lock().unwrap();

        vec[i]
    }
    
    pub fn clear(&self, i : usize) {
        let mut vec = self.data.lock().unwrap();

        vec[i] = false;
    }
    
    pub fn report(&self) {
        let vec = self.data.lock().unwrap();

        for i in 2..(vec.len() - 1) {
            if vec[i] {
                println!("Prime: {}", i);
            }
        }
    }
}

fn sieve_of_eratosthenes(n : usize) {
    
    // Create the vector to hold the results
    let vec = Arc::new(Data { data : Mutex::new(Vec::with_capacity(n + 1)) });
    
    vec.init(n);
    
    // Create 8 worker threads to use.
    let mut channels = Vec::with_capacity(8);
    let mut threads  = Vec::with_capacity(8);
    
    for _ in 0..8 {
        let (tx, rx) = mpsc::channel::<(usize, usize)>();
        let mine     = vec.clone();
        
        let t = thread::spawn(move || {
                                
            while let Ok((i, n)) = rx.recv() {
                let mut j = 2 * i;
                
                while j < n {
                    
                    mine.clear(j);
                                                
                    j += i;
                }                            
            }         
        });
        
        channels.push(tx);
        threads.push(t);
    }
    

    // Main loop.
    let mut t = 0;                
    let mut i = 2;
    
    while i * i <= n {
        
        // It is possible a worker thread hasn't cleared this offset yet, so this
        // check is not 100% which means we may waste some threads on non primes.
        if vec.is_set(i) {
            channels[t].send((i, n));
            t = (t + 1) & 0x7;
        }
                    
        i += 1;
    }
        
    // Destroy all the channels, so the worker threads stop then wait for them.
    channels.clear();
    
    while let Some(t) = threads.pop()
    {
        t.join();
    }
    
    // Print out the results
    vec.report();
}

fn main() {
    sieve_of_eratosthenes(1024 * 1024)
}

#5

I think what this really means is that the data restrictions rust applies on your design is such that your code does not have data races. Or at least it is more difficult to write code that does have data races. And this is the last post…I promise.


#6

If you’re interested, using Vec<std::sync::atomic::AtomicBool> instead of Mutex<Vec<bool>> will likely be faster. This allows mutations to different parts of the vectors to happen concurrently, without having to lock, and without having to use iterators to ensure things don’t alias.


#7

Crap, good point, I forgot out AtomicBool, that would probably be better.


#8

This algorithm is not a really good example. (In C++ i get not a big speedup with multiple cores) But its simple enough to try Multi-Threading in Rust I thought.:sweat_smile:

Thank you for your answers! I think i will try both versions (simple_parallel and with Threads) to look how good it works.


#9

We’d be able to be more help if you posted the original C++ code to compare.


#10

No problem:

Sequential version:

void sieve_of_eratosthenes(int n) {

    vector<bool> a(n, true);

    int i = 2;
    while (i*i <= n)
    {
        if (a[i] == true)
        {
            for (int j = 2*i; j <= n; j+=i)
            {
                a[j] = false;
            }
        }
        i++;
    }
}

And the parallel-version:

void sieve_of_eratosthenes(int n) {

    concurrent_vector<bool> a(n,true);

    int i = 2;
    while (i*i <= n)
    {
        if (a[i] == true)
        {
            parallel_for(2*i, n, i, [&](int j)
            {
                a[j] = false;
            });
        }
        i++;
    }
}

with including the TBB Headers: #include "tbb/tbb.h"


#11

Ah, OK. I believe TBB will do multiple values in a single thread to ensure there’s not too much overhead, i.e. along the lines of my suggestion above: