I tested the Rust and Java code for "estimating pi in parallel":
use rand::{thread_rng, Rng};
use rayon::iter::{IntoParallelIterator, ParallelIterator};
const N: i32 = i32::MAX;
fn main() {
let n: i32 = (0..N).into_par_iter().map(|_| {
let mut rng = thread_rng();
let x = rng.gen_range(0.0..1.0) - 0.5f64;
let y = rng.gen_range(0.0..1.0) - 0.5f64;
if x.powf(2.0) + y.powf(2.0) < 0.25 { 1 } else { 0 }
}).sum();
let pi = n as f64 / N as f64 * 4.0;
println!("π ≈ {pi}");
}
import java.util.concurrent.ThreadLocalRandom;
import java.util.stream.IntStream;
public class Main {
static final int N = Integer.MAX_VALUE;
public static void main(String[] args) {
int count = IntStream.range(0, N).parallel().map(i -> {
double x = ThreadLocalRandom.current().nextDouble(0.0, 1.0) - 0.5;
double y = ThreadLocalRandom.current().nextDouble(0.0, 1.0) - 0.5;
return (x * x + y * y < 0.25) ? 1 : 0;
}).sum();
double pi = (double) count / N * 4.0;
System.out.println("π ≈ " + pi);
}
}
The Rust version took 7 seconds to execute on my computer, while Java took only 3 seconds.
I don't understand why is that, is Rayon slower than the Java standard library?
Indeed, looking at the output from the samply profiler, I saw that most of the time was spent in random number generation. Swapping out rng.gen_range for fastrand::f64 (from the fastrand crate) reduced the runtime from ~9s to ~4s on my system.
Even when using the rand crate with its default settings, using rng.gen::<f64>() was faster than gen_range and returned the same results.
Your Rust version, when run with cargo build --release && time cargo run --release, takes 2.4 seconds on my system. Following Parallel RNGs - The Rust Rand Book to get faster randomness in parallel gets me to the following code, and saves 10% time, resulting in it taking about 2.2 seconds:
use rand::{
distributions::{Distribution as _, Uniform},
thread_rng,
};
use rayon::iter::{IntoParallelIterator, ParallelIterator};
const N: i32 = i32::MAX;
fn main() {
let range = Uniform::new(-0.5f64, 0.5);
let n: i32 = (0..N)
.into_par_iter()
.map_init(thread_rng, |rng, _| {
let x = range.sample(rng);
let y = range.sample(rng);
if x.powf(2.0) + y.powf(2.0) < 0.25 {
1
} else {
0
}
})
.sum();
let pi = n as f64 / N as f64 * 4.0;
println!("π ≈ {pi}");
}
That's a small speed-up from going to one RNG initialization per thread.
I can get it down to 0.6 seconds by using SmallRng (a fast RNG) instead of the default CSPRNG:
use rand::{
distributions::{Distribution as _, Uniform},
rngs::SmallRng,
thread_rng, SeedableRng,
};
use rayon::iter::{IntoParallelIterator, ParallelIterator};
const N: i32 = i32::MAX;
fn main() {
let range = Uniform::new(-0.5f64, 0.5);
let n: i32 = (0..N)
.into_par_iter()
.map_init(
|| SmallRng::from_rng(thread_rng()).expect("Could not create a fast RNG"),
|rng, _| {
let x = range.sample(rng);
let y = range.sample(rng);
if x.powf(2.0) + y.powf(2.0) < 0.25 {
1
} else {
0
}
},
)
.sum();
let pi = n as f64 / N as f64 * 4.0;
println!("π ≈ {pi}");
}
If you get the same scaling as I did, that'll reduce your Rust run to under a second.