Parallelizing a random Simulation

So all of this is a challenge froma youtube video about speeding up this little experiment:

Lets say you roll a 4 sided die 231 times and count every time it lands on 1.
If you run this 1_000_000_000 times, can you get a result of 177 or more in any single one of these runs? (very unlikely and analyzing the distribution would probably be a better indicator but as I said I wanted to speed up the python script the youtuber used and he didn't analyze the distribution)

My first naive approach was implementing this in single threaded rust

use std::iter;

use rand::{seq::IteratorRandom, thread_rng};

fn main() {
    let mut rng = thread_rng();

    let rolls: Vec<_> = iter::repeat_with(|| {
        (0..231)
            .map(|_| (1..=4).choose(&mut rng).unwrap()) // Using a range instead of an array
            .filter(|&x| x == 1)
            .count()
    })
    .take(1_000_000_000)
    .collect();

    let max_ones = rolls.iter().max().unwrap_or(&0);

    println!("Highest Ones Roll: {}", max_ones);
}

Which I then sped up multiple times by using multithreading which i iterated on and my current solution is the following. Note that the threads_with_extra thing is there in case that the number of rolls isn't cleanly divisable through the number of threads

use std::{
    iter,
    sync::mpsc::{self, Receiver, Sender},
    thread,
};

use rand::{seq::IteratorRandom, thread_rng};

const ROLLS: usize = 1_000_000_000;

fn main() {

    let cpus = num_cpus::get();
    let rolls_per_thread = ROLLS / cpus;
    let remaining = ROLLS.rem_euclid(cpus);

    let num_threads_with_extra = remaining;
    let num_threads_without_extra = cpus - remaining;

    let mut handles = Vec::with_capacity(cpus);

    let (tx, rx): (Sender<Vec<usize>>, Receiver<Vec<usize>>) = mpsc::channel();

    for _ in 0..num_threads_with_extra {
        let handle = thread::spawn(thread_function(
            rolls_per_thread + 1,
            tx.clone(),
        ));
        handles.push(handle);
    }

    for _ in 0..num_threads_without_extra {
        let handle = thread::spawn(thread_function(
            rolls_per_thread,
            tx.clone(),
        ));
        handles.push(handle);
    }

    let mut result = Vec::with_capacity(ROLLS);
    for _ in 0..handles.len() {
        let part = rx.recv().unwrap();
        result.extend_from_slice(&part);
    }

    handles
        .into_iter()
        .for_each(|handle| handle.join().unwrap());

    let max_ones = result.iter().max().unwrap_or(&0);
    let rolls = result.len();

    println!("Highest Ones Roll: {max_ones}");
    println!("Number of Roll Sessions: {rolls}");
}

#[inline]
fn thread_function(
    num_rolls: usize,
    result_sender: Sender<Vec<usize>>,
) -> impl FnOnce() {
    move || {
        let mut rng = thread_rng();

        let rolls: Vec<_> = iter::repeat_with(|| {
            (0..231)
                .map(|_| (1..=4).choose(&mut rng).unwrap()) // Using a range instead of an array
                .filter(|&x| x == 1)
                .count()
        })
        .take(num_rolls)
        .collect();

        result_sender.send(rolls).unwrap();
    }
}

One surprising speed up was declaring rolls as const. I thought optimization would do that but I guess i was wrong. another huge speadup was adding #[inline] to thread_function
I also tried to instead of allocating vecs and sending them through the channel to just send the max of each thread through the channel and compare the max values in the main thread but surprisingly that was slower.

Now to my question: anyone got any more suggestions as to how i could further optimize this to the limit? Since it's such a small snippet I don't care about readability :stuck_out_tongue:

another HUGE speedup i just found was using fastrand to rand. are there any significant downsides to fastrand for this kind of thing?

This depends on bthe context of your simulation. Faster random number generators sometimes have statistical properties in the produces numbers that skews the simulation result.

You can use the rayon crate to greatly condense your code: Rust Playground

(I hope I represented the task you want to achieve correctly there)

1 Like

I tried rayon once but it was slower than my attempt though I'll tra again if your implementation is faster, it's at least more condensed

Yeah, yours took almost half the time, mine was this

use rayon::iter::{repeatn, ParallelIterator};

use rand::{seq::IteratorRandom, thread_rng};

fn main() {

    let rolls = 1_000_000_000;
    let result: Vec<_> = repeatn((), rolls)
        .map(|_| {
             let mut rng = thread_rng();
            std::iter::repeat(())
                .take(231)
                .map(|_| {
                    (1..=4).choose(&mut rng).unwrap()
                })
                .filter(|&x| x == 1)
                .count()
        })
        .collect();

    let max_ones = result.iter().max().unwrap_or(&0);

    println!("Highest Ones Roll: {}", max_ones);
}

The biggest difference is that my solution does not collect all results. It just looks for the maximum.

And choose is suboptimal if you are looking at a uniform distribution of integers.

Here's my initial attempt. Haven't compared the runtime against any of the others.

Primarily, I tried to minimize the number of allocations, thread coordination, and maximize the number of bits used per random number generated.

use rand;

const ROLLS_PER_TRIAL: usize = 231;
const TRIALS: usize = 10_000_000;
const N_THREADS: usize = 4;

struct D4<R> {
    rng: R,
    buf: u64,
    cached_bits: u8,
}

impl<R: rand::Rng> Iterator for D4<R> {
    type Item=u8;
    #[inline]
    fn next(&mut self)->Option<u8> {
        if self.cached_bits == 0 {
            self.buf = self.rng.gen();
            self.cached_bits = 64;
        }
        
        let out = self.buf & 0x3;
        self.buf >>= 2;
        self.cached_bits -= 2;
        Some(1+out as u8)
    }
}

impl<R: rand::Rng> From<R> for D4<R> {
    fn from(rng:R)->Self {
        D4 { rng, buf:0, cached_bits: 0 }
    }
}

#[inline]
fn trial(d4: &mut D4<impl rand::Rng>)->usize {
    d4.take(ROLLS_PER_TRIAL)
        .filter(|&x| x == 1)
        .count()
}

#[inline]
fn histo(d4: &mut D4<impl rand::Rng>, n:usize)->[u32; 1+ROLLS_PER_TRIAL] {
    let mut out = [0; 1+ROLLS_PER_TRIAL];
    for _ in 0..n {
        out[trial(d4)] += 1;
    }
    out
}

fn main() {
    assert_eq!(TRIALS % N_THREADS, 0);
    let handles = [(); N_THREADS].map(|_| {
        std::thread::spawn(|| {
            let mut d4:D4<_> = rand::thread_rng().into();
            histo(&mut d4, TRIALS / N_THREADS)
        })      
    });

    let mut result = [0; 1+ROLLS_PER_TRIAL];
    for h in handles {
        for (tot,local) in result.iter_mut().zip(h.join().unwrap()) {
            *tot += local
        }
    }
    
    dbg!(result);
}
2 Likes

Oh I haven't even thought about buffering bits, I'll try mixing that with fastrand and rayon later on when I got time. another speedup I found was using the jemalloc allocator

The rand::StdRng random generator is "cryptographically strong" which means by definition that its outputs are indistinguishable from true randomness (as far as we know). This means that one of two things will happen when you use it:

  • you will get statistically correct results, or
  • you will have inadvertently broken cryptographic security of the generator (very unlikely but it'd be cool)

fastrand or some other generators in rand don't have this property. Which means that you never know: you might get the correct results or you might not. To me that defeats the purpose of running a simulation. So I'd never do that.

My policy is to always use cryptographically secure random generators if I need randomness. I wrote a blog post about this:
Need a PRNG? Use a CSPRNG

1 Like

So what use is fastrand then? when would I use a rng that might be predictable?

For me the answer is: virtually never.

I'll very occasionally use an LCG, but only for toys that have nothing riding on the outcome. Even then, I'll only go there if I have special requirements, like implementing the RNG myself or minimizing the code/binary size.

These shouldn't make a difference at all. Could be an issue with how you're timing this. Also, are you running in release mode?

You can probably retool to not need much allocation at all if you do some intermediate processing on the threads. Either calculate a histogram of results like my example does or just keep track of the max seen and throw away the individual run results.

You can improve this by using map_init (just like the example in the doc)

.map_init(thread_rng, |rng, _| {
    rng.sample_iter(&dice)
        .take(n_throws)
        .filter(|x| *x == 1)
        .count()
})

It should improve speed, but there's enough happening in each iteration that it won't be very much.

1 Like

I am using hyperfine to time and am running in release