Hi all,
This isn't really a rust-specific question, but this is my best place to look for help.
I'm wanting to estimate the median from a stream of numbers generated via a Monte Carlo simulation. I don't know in advance how many numbers will be provided before I need the median, but it could be up to 1e13 in practice, so storing all the numbers is not an option. The numbers are correlated (we don't know how much), so we don't want to just pick the first N to store. I'd like the algorithm to be fast, and accurate, and require little memory. The memory is somewhat significant because we regularly checkpoint the state of the simulation, so e.g. if we stored 1e9 numbers we'd be writing out 8 gigabytes to disk on each checkpoint, which would be far from ideal.
My current code (see https://github.com/droundy/sad-monte-carlo/blob/master/src/mc/energy_replicas.rs#L41) stores at most N numbers (currently 32). When it fills up, it keeps the middle half of the numbers, and doubles the "stride" at which it stores numbers. It just uses a Vec
, so I'm sorting the numbers each time I need to prune.
The algorithm doesn't need to be very accurate. What is most important is that I never end up with a "median" that is e.g. in the 90th or 10th percentile. The accuracy of the simulation isn't affected by the accuracy of the median, only its efficiency, but that can be drastically affected if I'm at 99% (as I have been recently). The current bug I'm tracking down (causing it to be at 99%) probably isn't in the median estimator, but looking for it reminded me that my algorithm was a bit hastily thrown together, and there might be nicer algorithms out there! (edit: Most likely the issue is that I'm computing the median before I've collected enough data to get beyond the correlation time.)
Here is a slightly simplified view of my current algorithm:
pub struct MedianEstimator {
energies: Vec<f64>,
step: usize,
skip: usize,
}
const ESTIMATOR_SIZE: usize = 32;
impl MedianEstimator {
fn new(e: f64) -> Self {
let mut energies = Vec::with_capacity(ESTIMATOR_SIZE);
energies.push(e);
MedianEstimator {
energies,
step: 0,
skip: 1,
}
}
fn add_energy(&mut self, e: f64) {
if self.step == 0 {
if self.energies.len() < ESTIMATOR_SIZE {
self.energies.push(e);
} else {
// Keep the middle 1/2 of our energies.
self.energies.sort_by(|a, b| a.partial_cmp(b).unwrap());
{
// Move the third quarter of the energies back into the
// first quarter, to the middle half of energies will be
// in the first half of the vec.
let (left, right) = self.energies.split_at_mut(ESTIMATOR_SIZE / 2);
left[0..ESTIMATOR_SIZE / 4].copy_from_slice(&right[0..ESTIMATOR_SIZE / 4]);
}
self.energies.truncate(ESTIMATOR_SIZE / 2);
self.skip *= 2;
}
}
self.step = (self.step + 1) % self.skip;
}
fn median(&mut self) -> f64 {
self.energies.sort_by(|a, b| a.partial_cmp(b).unwrap());
if self.energies.len() & 1 == 0 {
self.energies[self.energies.len() / 2]
} else {
0.5 * (self.energies[self.energies.len() / 2]
+ self.energies[self.energies.len() / 2 + 1])
}
}
}