Observed CDF of a vector

The empirical distribution function, or observed CDF, comes up a lot in my field of study. Here I have implemented a simple/naive version that performs the calculation, and it works fine on reasonably small sample sizes. But when I start working with ratios derived from river discharge data, now I have a million observations, and my cdf function has started to slow down.

Is there some low-lying fruit to be plucked here to improve throughput? I appreciate your help.

fn cdf(x: &[f64]) -> Vec<(f64, f64)> {
    let ln = x.len() as f64;
    let mut x_ord = x.to_vec();
    x_ord.sort_by(|a, b| a.partial_cmp(b).unwrap());
    let mut cdf = Vec::new();
    for i in x_ord {
        let num = x.par_iter().filter(|x| **x <= i).count() as f64 / ln;
        cdf.push((i, num));

View in playground

Since you're probably going to be sampling randomly from this using a binary search on the second f64, and you factor in duplicates, there's no need to sort or dedup or count, just assign an incremental percent to each one.

So I can skip the initial sort and dedup as shown below, is this what you meant? How do I make count go away?

pub fn cdf(x: &[f64]) -> Vec<f64> {
    let ln = x.len() as f64;
    let mut y = Vec::new();
    for i in x {
        y.push(x.iter().filter(|x| *x <= i).count() as f64 / ln);

Once you sort the inputs, the count of elements <= any unique element is one greater than it's index; for duplicate elements, the same is true of the element at the greatest index. So you could just reduce the sorted Vec.

This could maybe be replaced by dedup_by, if it has enough guarantees, in which case you could map the input to tuples and sort those to avoid one of the two allocations. Didn't seem worth attempting as a first go though.

1 Like

The code example really helps, I see what you both mean now.