Algorithm to find median from stream of correlated numbers?

There are at least two algorithms for this, IIRC. The critical thing you'll need to pick is how you want to phrase the guarantee -- do you want it to be within ε percent of the real value, or within ε percentiles of the real value?

Here's one algorithm (I first saw it here, which indicates it does the latter bound), although it looks like it might be worse at medians than at any other percentile: https://github.com/tdunning/t-digest/blob/main/docs/t-digest-paper/histo.pdf

Definitely percentiles.

Whenever you're calculating the mean, I'd strongly suggest calculating the standard deviation too (if you're not already), since there's an easy, parallelizable, and numerically-stable algorithm to do so that's still O(1) space (just one more double): Algorithms for calculating variance - Wikipedia

It might not be directly useful, but I find it handy as a starting point for deciding how much I should bother looking for something better. Like 10 s ± 0.1 Np means that just saying "about 10" is probably good enough, but I also often get things like 2 s ± 1.6 Np, and need to look at the numbers in a more nuanced way. (I'm usually looking at one-sided data, so take logs to get geometric means and stdevs.)

In my context the energy is always below a given value (a cutoff energy) and the distance of the mean from the cutoff and standard deviation will always be close. (Unless something seriously crazy happens, like an energy gap, which is theoretically possible but not likely for the kinds of systems I'm likely to be looking at.

1 Like

It's wonderful to calculate the median, mean, and standard deviation, but the results may mislead you if you're not sampling from something approximating a Gaussian distribution. The skewness and kurtosis can give you hints if that is the case. I have no idea how to calculate them from a stream.

From xkcd today, another alternative:

More seriously, I'm looking at a distribution that is not remotely gaussian. It is often approximately exponential, but not all often that I'd want to assume that.

1 Like

Have you considered using variational inference for this problem?

Can you explain further? A quick googling doesn't hello me understand your suggestion.

VI is machine learning for updating and selecting (variating) probability distributions in a data-streaming setting. You would get a PDF which you could reasonably derive the median from. Since you start by establishing expectations the results can be approximated (inferred) with relatively sparse data.

Take a look at the quantiles crate. It implements interesting algorithms to keep the amount of memory stored small while maintaining an upper bound on error. I think this is what you’d want.

If the memory requirements are still too large, try increasing the error tolerance.

There’s also the average crate which implements the Quantile type for keeping track of an arbitrary quantile with constant memory but unbounded error. If I were you, I would try to find a way to make the quantiles crate work before trying this one. Unbounded error sounds like it could be tricky.

6 Likes

You could also implement the median-of-medians algorithm. In brief, the median of some disjoint and covering subsets of a sample is an estimator of the sample median itself. It's not exact and I don't have an instinct as to how good it actually is, but at least it allows you to define a clear memory-accuracy trade-off (the more subsets you generate and the more intermediate medians you store, the closer the estimation will be to the true value).

Probably pretty good if you can reliably obtain the covering subset for your sample space, in practice that is something which relies on ergodic averaging and finding it is a related problem in and of its self .

That is the larger challenge. I do have a measure of how many guaranteed-uncorrelated visits there have been to the subspace in question, which I use to decide when I probably have enough information, but that can't tell me whether I've adequately sampled the subspace.

Generally you would always run multiple discrete samples with identical characteristics and test for convergence using Rhat (et. al.) and possibly try to generate a trace-plot. I don't think there is any good way to do this sort of adequacy testing from a single sample run.

Any approach involving an assumed form for the probability distribution seems like a non-starter. I'm not willing to bias my computation in that way, even if it is for a computation that is easily checked afterwards (and is actually checked, and can in principle be automatically corrected).

If it's any consolation, enough data will eventually nullify your bias.

A correct and reasonably fast (as in best result on Continuous Median – Kattis, Kattis ) can be implemented by keeping a "current median", and a BinaryHeap each for values below and above it.

The size of the heaps will grow with the number of values, but it's not possible to get anything close to a median any other way unless you know before the start how many values there will be, and then you still need to store half the values.

2 Likes

Having storage proportional to the number of samples is a non-starter: I don't have that much RAM. My biggest machine has 128 GB of RAM. And it's also false that you can't reasonably approximate the median with smaller storage, this thread lists numerous algorithms that do just that, the simplest (and what I'm currently experimenting with) being to find the median of a random subset of the data so far:

I expect the distinction in terms of "close" is whether we consider a number with 49% of the distribution above it. I would call that (for my purposes) close to the median, regardless of how much it might differ numerically. Even 32 random samples gives pretty good error, and if you increase the number it gets better as the square root of the number of samples stored, which isn't too bad.

Then you will have to do without the median. In practice, that "median of a random sample" may or may not be good enough, but beware that it will not give you a number that is close to the median, it will give you a number that is likely to be close to the median. Taking the median of a random 1 % sample of a big number of numbers will often get you close to the real median, but with bad luck (or a systemtic error in choosing samples) you may get a value that's 99% off.

But if a number that is probably, in most cases, close to the median is good enough, than that random sample sounds about as good as it gets.

1 Like

I agree, one should be more than a little wary of schemes relying on the 'average of averages' or a 'random sample from a random sample'.