Is rayon's parallel iterator slower than java?

I tested the Rust and Java code for "estimating pi in parallel":

use rand::{thread_rng, Rng};
use rayon::iter::{IntoParallelIterator, ParallelIterator};
const N: i32 = i32::MAX;
fn main() {
    let n: i32 = (0..N).into_par_iter().map(|_| {
        let mut rng = thread_rng();
        let x = rng.gen_range(0.0..1.0) - 0.5f64;
        let y = rng.gen_range(0.0..1.0) - 0.5f64;
        if x.powf(2.0) + y.powf(2.0) < 0.25 { 1 } else { 0 }
    }).sum();
    let pi = n as f64 / N as f64 * 4.0;
    println!("π ≈ {pi}");
}
import java.util.concurrent.ThreadLocalRandom;
import java.util.stream.IntStream;
public class Main {
    static final int N = Integer.MAX_VALUE;
    public static void main(String[] args) {
        int count = IntStream.range(0, N).parallel().map(i -> {
            double x = ThreadLocalRandom.current().nextDouble(0.0, 1.0) - 0.5;
            double y = ThreadLocalRandom.current().nextDouble(0.0, 1.0) - 0.5;
            return (x * x + y * y < 0.25) ? 1 : 0;
        }).sum();
        double pi = (double) count / N * 4.0;
        System.out.println("π ≈ " + pi);
    }
}

The Rust version took 7 seconds to execute on my computer, while Java took only 3 seconds.
I don't understand why is that, is Rayon slower than the Java standard library?

One difference is that rand::thread_rng is a cryptographically-strong PRNG, while java.util.concurrent.ThreadLocalRandom is not.

4 Likes

Indeed, looking at the output from the samply profiler, I saw that most of the time was spent in random number generation. Swapping out rng.gen_range for fastrand::f64 (from the fastrand crate) reduced the runtime from ~9s to ~4s on my system.

Even when using the rand crate with its default settings, using rng.gen::<f64>() was faster than gen_range and returned the same results.

4 Likes

Your Rust version, when run with cargo build --release && time cargo run --release, takes 2.4 seconds on my system. Following Parallel RNGs - The Rust Rand Book to get faster randomness in parallel gets me to the following code, and saves 10% time, resulting in it taking about 2.2 seconds:

use rand::{
    distributions::{Distribution as _, Uniform},
    thread_rng,
};
use rayon::iter::{IntoParallelIterator, ParallelIterator};

const N: i32 = i32::MAX;

fn main() {
    let range = Uniform::new(-0.5f64, 0.5);

    let n: i32 = (0..N)
        .into_par_iter()
        .map_init(thread_rng, |rng, _| {
            let x = range.sample(rng);
            let y = range.sample(rng);
            if x.powf(2.0) + y.powf(2.0) < 0.25 {
                1
            } else {
                0
            }
        })
        .sum();
    let pi = n as f64 / N as f64 * 4.0;
    println!("π ≈ {pi}");
}

That's a small speed-up from going to one RNG initialization per thread.

I can get it down to 0.6 seconds by using SmallRng (a fast RNG) instead of the default CSPRNG:

use rand::{
    distributions::{Distribution as _, Uniform},
    rngs::SmallRng,
    thread_rng, SeedableRng,
};
use rayon::iter::{IntoParallelIterator, ParallelIterator};

const N: i32 = i32::MAX;

fn main() {
    let range = Uniform::new(-0.5f64, 0.5);

    let n: i32 = (0..N)
        .into_par_iter()
        .map_init(
            || SmallRng::from_rng(thread_rng()).expect("Could not create a fast RNG"),
            |rng, _| {
                let x = range.sample(rng);
                let y = range.sample(rng);
                if x.powf(2.0) + y.powf(2.0) < 0.25 {
                    1
                } else {
                    0
                }
            },
        )
        .sum();
    let pi = n as f64 / N as f64 * 4.0;
    println!("π ≈ {pi}");
}

If you get the same scaling as I did, that'll reduce your Rust run to under a second.

5 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.