Algorithm to find median from stream of correlated numbers?

Hi all,

This isn't really a rust-specific question, but this is my best place to look for help. :slight_smile:

I'm wanting to estimate the median from a stream of numbers generated via a Monte Carlo simulation. I don't know in advance how many numbers will be provided before I need the median, but it could be up to 1e13 in practice, so storing all the numbers is not an option. The numbers are correlated (we don't know how much), so we don't want to just pick the first N to store. I'd like the algorithm to be fast, and accurate, and require little memory. The memory is somewhat significant because we regularly checkpoint the state of the simulation, so e.g. if we stored 1e9 numbers we'd be writing out 8 gigabytes to disk on each checkpoint, which would be far from ideal.

My current code (see https://github.com/droundy/sad-monte-carlo/blob/master/src/mc/energy_replicas.rs#L41) stores at most N numbers (currently 32). When it fills up, it keeps the middle half of the numbers, and doubles the "stride" at which it stores numbers. It just uses a Vec, so I'm sorting the numbers each time I need to prune.

The algorithm doesn't need to be very accurate. What is most important is that I never end up with a "median" that is e.g. in the 90th or 10th percentile. The accuracy of the simulation isn't affected by the accuracy of the median, only its efficiency, but that can be drastically affected if I'm at 99% (as I have been recently). The current bug I'm tracking down (causing it to be at 99%) probably isn't in the median estimator, but looking for it reminded me that my algorithm was a bit hastily thrown together, and there might be nicer algorithms out there! (edit: Most likely the issue is that I'm computing the median before I've collected enough data to get beyond the correlation time.)

Here is a slightly simplified view of my current algorithm:

pub struct MedianEstimator {
    energies: Vec<f64>,
    step: usize,
    skip: usize,
}
const ESTIMATOR_SIZE: usize = 32;
impl MedianEstimator {
    fn new(e: f64) -> Self {
        let mut energies = Vec::with_capacity(ESTIMATOR_SIZE);
        energies.push(e);
        MedianEstimator {
            energies,
            step: 0,
            skip: 1,
        }
    }
    fn add_energy(&mut self, e: f64) {
        if self.step == 0 {
            if self.energies.len() < ESTIMATOR_SIZE {
                self.energies.push(e);
            } else {
                // Keep the middle 1/2 of our energies.
                self.energies.sort_by(|a, b| a.partial_cmp(b).unwrap());
                {
                    // Move the third quarter of the energies back into the
                    // first quarter, to the middle half of energies will be
                    // in the first half of the vec.
                    let (left, right) = self.energies.split_at_mut(ESTIMATOR_SIZE / 2);
                    left[0..ESTIMATOR_SIZE / 4].copy_from_slice(&right[0..ESTIMATOR_SIZE / 4]);
                }
                self.energies.truncate(ESTIMATOR_SIZE / 2);
                self.skip *= 2;
            }
        }
        self.step = (self.step + 1) % self.skip;
    }
    fn median(&mut self) -> f64 {
        self.energies.sort_by(|a, b| a.partial_cmp(b).unwrap());
        if self.energies.len() & 1 == 0 {
            self.energies[self.energies.len() / 2]
        } else {
            0.5 * (self.energies[self.energies.len() / 2]
                + self.energies[self.energies.len() / 2 + 1])
        }
    }
}

The median is defined as :

The median is the middle number in a data set.

The way to calculated for a given set of numbers is:

To find the median, list your data points in ascending order and then find the middle number.

Seems to me that if one has an infinite sequence to deal with one could proceed as follows:

  1. Read the first number. Set the median to that value.
  2. For every subsequent number:
    IF number > median THEN increment the median value by 1.
    IF number < median THEN decrement the median value by 1.

That process ensures ones running median value always has the same number of previously read inputs that are greater than it as less than it. Which is to say, it is actually the median of the past data at all times.

Or is my logic going wrong somewhere?

1 Like

This would work great if your numbers were integers (and is quite clever). My numbers are floating point numbers, so there is no value equivalent to "1" by which I can increment or decrement the estimate.

Actually I think my logic is all wrong.

Given the sequence, as in the link I gave, [ 23, 24, 26, 26, 28, 29 , 30, 31, 33, 34 ] my algorithm would just ramp itself up to 34.

Another counterexample is [0, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000]. Here it sets the median to 10.

Use rand::seq::sample_iter to randomly sample some of your data as it comes in, and then compute the median of the sample at the end.

Check out chapters 9 and 10 in these lecture notes. They give an algorithm (Munro and Paterson 1980) for finding the median exactly in a streaming fashion, and another one (Greenwald-Khanna) for approximating any quantile.

(Kudos to my old CS professor for including this stuff in the algorithms course I took.)

Edit: the presentation of the Munro-Paterson algorithm is phrased in terms of integers; not sure whether that's essential or whether you can just plug in floating-point values. But it looks like Greenwald-Khanna is closer to what you're looking for anyway?

7 Likes

That would work if my data were coming from an iterator and there weren't a lot of other things going on in the program. I could look up the algorithm it uses and implement that.

But I'm not sure it's optimal for a given amount of storage, since you're evicting randomly chosen data from your cache, and by evicting outliers you could converge more closely to the true median. (Of course, I didn't ask for an optimal algorithm...)

Back in 1979 we had the task of calculating the Standard Deviation on the fly of the weight of product as a machine produced 10,000 pieces per minute and weighed them. This was impossible using the 8 bit 8080 processors we had. But we managed it with an algorithm like below and some hardware counter help (it was previously done without a processor, just comparator and counter hardware logic):

  1. Assume the samples have a Normal Distribution.

  2. Maintain a value Sh. Set Sh equal to the value of the first sample.

  3. For every new sample:
    IF the sample is greater than Sh THEN increment S by 1.
    IF there has been 13 samples (not necessarily consecutive) less than S THEN decrement Sh by 1. (reset that 13 count back to 0)

Now, given the samples are kind of random, not some degenerate sequence, Sh will wonder around. Sh will mark a point on high tail of the normal curve where the area below the curve is 13 times greater on the low side than the high side.

Now, do all that the other way around for a value Sl. Sl will then float around and mark a point on the low tail of the curve.

Given the fact that this is a Normal Distribution then the difference between Sh and Sl will be the mean.

Given that we know the areas under those tails is 1/13 the whole area we can arrive at the value of the Standard Deviation (Just a multiply by a constant, the value of which is an exercise for the reader).

Perhaps, maybe, the median can be discerned from all this. No idea.

The downside is that if there is a sudden rise in Standard Deviation in the sequence the calculated value ramps up very quickly, whereas if there is a sudden drop in Standard Deviation it ramps down 13 times more slowly.

Would the mean, which is easier to calculate from a stream, be a suitable proxy for the median here? They're reasonably close to each other for a lot of distributions.

Alternatively, if you have some prior idea of the distribution, you could build a histogram and return an arbitrary representative from the median bucket.

1 Like

Not really, unless your data is already independent random.

Those outliers may turn out to be the place where the median actually is, and if it is, you will be missing all the data that allow to get you anywhere close to the median.

Say your data looks like this:
10, 10, 10, ..., 10, (million times)
5, 5, 5, ..., 5 (million times)
1, 1, 1, ..., 1 (million times)
// now you evict outliers 1 and 10
20, 20, 20, ..., 20
30, 30, 30, ..., 30

The median is 10, and there is a lot of them in the data, but you have thrown them all away.

If you sample randomly, you will get very close with high probability. Random sampling can be done like this: let's say you have space for n elements. First, collect n elements. Then, when element number m > n comes in, then with probability n/m replace a random element in the sample. This will give a uniformly random sample at the end.

If you want to be more accurate, you can do a more complicated thing: whenever you run out of memory, throw out every other element in sorted order. Don't throw away outliers: throw away evenly throughout your data, so that the missing gaps are always relatively short sequences. Just remember how many elements were originally in each gap. At the end, you can find the gap in which the true median lies, and if you keep the gaps evenly short, you will be guaranteed to be close.

2 Likes

Pretty sure in your example if you random sample you’ll end up with 30 as the median.

This problem is more complicated than it seems at first glance. And if my eyes didn’t glaze over every time I read mathy papers I’m sure @cole-miller’s link provides an elegant solution though.

No, you will end up with approximately equal number of 1s, 5s, 10s, 20s and 30s in the sample, of which 10 will be the median with high probability.

That of course assumes that you sample properly with equal probabilities for each element, which can be done as I described above.

Sorry I missed the n/m probability part

If your floating point samples are in some known range, say 0.0 to 1.0, then why not divide that range into a number of "buckets". Say 1024. Each bucket will have a counter.

Now for each incoming sample find out what bucket it belongs to and increment that bucket's counter.

Over time the bucket counts will be an approximation of the distribution of the samples. Whatever shape it is. From which it's an easy task to find the median bucket.

For more accuracy use more buckets. Balance memory use against accuracy.

When it comes to check pointing, do you really need the median of the whole sequence since forever? Or can you just checkpoint the calculated median at regular intervals?

2 Likes

That bucket-count idea is called a histogram, which I agree is probably a reasonable way to approximate this.

1 Like

Ha! So it is. Such a long time since I needed that word...

1 Like

Actually, that's what I started out doing and rejected it in favor of estimating the median because it tends to have a systematic bias. But that was way earlier in the algorithm development, and maybe would be a good move now. It's certainly way simpler, and I track the mean anyhow. It's frequently at the 60th or even 70th percentile, but that might be just fine for my uses.

The histogram idea fails because I don't have an idea of the energy range. In a sense, I'm using my estimate of the median to find this out.

I'm actually dividing the energy range up into buckets, and the median I'm computing is that of the states that are below my minimum bucket!

2 Likes

Ah ha.

Sounds like you are going to need a quantum computer :slight_smile: