So all of this is a challenge froma youtube video about speeding up this little experiment:
Lets say you roll a 4 sided die 231 times and count every time it lands on 1.
If you run this 1_000_000_000 times, can you get a result of 177 or more in any single one of these runs? (very unlikely and analyzing the distribution would probably be a better indicator but as I said I wanted to speed up the python script the youtuber used and he didn't analyze the distribution)
My first naive approach was implementing this in single threaded rust
use std::iter;
use rand::{seq::IteratorRandom, thread_rng};
fn main() {
let mut rng = thread_rng();
let rolls: Vec<_> = iter::repeat_with(|| {
(0..231)
.map(|_| (1..=4).choose(&mut rng).unwrap()) // Using a range instead of an array
.filter(|&x| x == 1)
.count()
})
.take(1_000_000_000)
.collect();
let max_ones = rolls.iter().max().unwrap_or(&0);
println!("Highest Ones Roll: {}", max_ones);
}
Which I then sped up multiple times by using multithreading which i iterated on and my current solution is the following. Note that the threads_with_extra
thing is there in case that the number of rolls isn't cleanly divisable through the number of threads
use std::{
iter,
sync::mpsc::{self, Receiver, Sender},
thread,
};
use rand::{seq::IteratorRandom, thread_rng};
const ROLLS: usize = 1_000_000_000;
fn main() {
let cpus = num_cpus::get();
let rolls_per_thread = ROLLS / cpus;
let remaining = ROLLS.rem_euclid(cpus);
let num_threads_with_extra = remaining;
let num_threads_without_extra = cpus - remaining;
let mut handles = Vec::with_capacity(cpus);
let (tx, rx): (Sender<Vec<usize>>, Receiver<Vec<usize>>) = mpsc::channel();
for _ in 0..num_threads_with_extra {
let handle = thread::spawn(thread_function(
rolls_per_thread + 1,
tx.clone(),
));
handles.push(handle);
}
for _ in 0..num_threads_without_extra {
let handle = thread::spawn(thread_function(
rolls_per_thread,
tx.clone(),
));
handles.push(handle);
}
let mut result = Vec::with_capacity(ROLLS);
for _ in 0..handles.len() {
let part = rx.recv().unwrap();
result.extend_from_slice(&part);
}
handles
.into_iter()
.for_each(|handle| handle.join().unwrap());
let max_ones = result.iter().max().unwrap_or(&0);
let rolls = result.len();
println!("Highest Ones Roll: {max_ones}");
println!("Number of Roll Sessions: {rolls}");
}
#[inline]
fn thread_function(
num_rolls: usize,
result_sender: Sender<Vec<usize>>,
) -> impl FnOnce() {
move || {
let mut rng = thread_rng();
let rolls: Vec<_> = iter::repeat_with(|| {
(0..231)
.map(|_| (1..=4).choose(&mut rng).unwrap()) // Using a range instead of an array
.filter(|&x| x == 1)
.count()
})
.take(num_rolls)
.collect();
result_sender.send(rolls).unwrap();
}
}
One surprising speed up was declaring rolls as const. I thought optimization would do that but I guess i was wrong. another huge speadup was adding #[inline]
to thread_function
I also tried to instead of allocating vecs and sending them through the channel to just send the max of each thread through the channel and compare the max values in the main thread but surprisingly that was slower.
Now to my question: anyone got any more suggestions as to how i could further optimize this to the limit? Since it's such a small snippet I don't care about readability