The empirical distribution function, or observed CDF, comes up a lot in my field of study. Here I have implemented a simple/naive version that performs the calculation, and it works fine on reasonably small sample sizes. But when I start working with ratios derived from river discharge data, now I have a million observations, and my cdf function has started to slow down.
Is there some low-lying fruit to be plucked here to improve throughput? I appreciate your help.
fn cdf(x: &[f64]) -> Vec<(f64, f64)> {
let ln = x.len() as f64;
let mut x_ord = x.to_vec();
x_ord.sort_by(|a, b| a.partial_cmp(b).unwrap());
x_ord.dedup();
let mut cdf = Vec::new();
for i in x_ord {
let num = x.par_iter().filter(|x| **x <= i).count() as f64 / ln;
cdf.push((i, num));
}
cdf
}
Since you're probably going to be sampling randomly from this using a binary search on the second f64, and you factor in duplicates, there's no need to sort or dedup or count, just assign an incremental percent to each one.
So I can skip the initial sort and dedup as shown below, is this what you meant? How do I make count go away?
pub fn cdf(x: &[f64]) -> Vec<f64> {
let ln = x.len() as f64;
let mut y = Vec::new();
for i in x {
y.push(x.iter().filter(|x| *x <= i).count() as f64 / ln);
}
y
}
Once you sort the inputs, the count of elements <= any unique element is one greater than it's index; for duplicate elements, the same is true of the element at the greatest index. So you could just reduce the sorted Vec.
This could maybe be replaced by dedup_by, if it has enough guarantees, in which case you could map the input to tuples and sort those to avoid one of the two allocations. Didn't seem worth attempting as a first go though.