Summary: I need to maintain a large number of things in a sorted structure with iteration, ith insertion and lookup. I'm presently using a Vec but the performance is abysmal. I could really use recommendations on alternative structures to use, fancy tricks.
Hidy folks,
I have an implementation of "Effective Computation of Biased Quantiles Over Data Streams" in my quantiles library that works pretty well if the number of samples you have is low-ish, say a handful of thousands. The implementation is here. CKMS is a data structure that allows you to answer approximate quantile queries on a stream of data in bounded space. The underlying algorithm has this neat trick where the samples stored into the structure are compressed in such a way that any quantile queries post-compression have a guaranteed error bound on then. You save space and get answers with a guaranteed accuracy. Tidy.
Now, the problem I've got is I'm finding a need for CKMS structures with significantly more samples and the insert performance with my implementation is just killing me. I've been fiddling with the implementation, have hit a brick wall and could use some advice on where to go next. I've put together an in-progress branch that has criterion benchmarks and some fiddling micro-optimizations. You can find that branch on Github. I run the benchmarks like cargo bench -- --test --nocapture
.
The basic problem I have with my implementation is this: quantiles/CKMS stores its samples, sorted, in a Vec. The algorithm needs access to i-1
, i
and i+1
elements in the underlying store – which is why I use a Vec – and must be able to perform sorted iteration on the elements.
Here, let's have a look at insertion. The whole function starts here but the important part is:
} else {
// inserting somewhere in the middle
let mut idx = 0;
let mut r = 0;
for smpl in self.samples.iter() {
match smpl.v.partial_cmp(&v).unwrap() {
cmp::Ordering::Less => {
idx += 1;
r += smpl.g
}
_ => break,
}
}
// To insert in the middle we have to know two things:
//
// * the correct point of insertion, 'idx'
// * the sum of all Entry.g prior to 'idx', called 'r'
//
// We could avoid the 'idx' computation if we could get at 'r'
// some other way.
(idx, invariant(r as f64, self.error) - 1)
};
self.samples.insert(
idx,
Entry {
v: v,
g: 1,
delta: delta,
},
);
In the paper v
is the value being inserted, g
is the difference in ranks between samples i
and i-1
and delta
is the difference between the lowest and greatest possible ranks of i
. Just, you know, incidentally. The important part here is that to insert we have to do a linear trawl through the Vec to compute r
so that we can compute delta
for the newly inserted item, as the comment points out, and then when we insert all the existing items get scooted down one in the Vec.
Compression – defined here – is much smaller than insertion.
fn compress(&mut self) {
if self.samples.len() < 3 {
return;
}
let mut s_mx = self.samples.len() - 1;
let mut idx = 0;
let mut r: u32 = 1;
while idx < s_mx {
let cur_g = self.samples[idx].g;
let nxt_v = self.samples[idx + 1].v;
let nxt_g = self.samples[idx + 1].g;
let nxt_delta = self.samples[idx + 1].delta;
if cur_g + nxt_g + nxt_delta <= invariant(r as f64, self.error) {
self.samples[idx].v = nxt_v;
self.samples[idx].g += nxt_g;
self.samples[idx].delta = nxt_delta;
self.samples.remove(idx + 1);
s_mx -= 1;
} else {
idx += 1;
}
r += 1;
}
}
We perform a pairwise step through the samples Vec, check against the invariant property the paper defines and, when possible, compress two samples into one (which turns out to be, modify one in-place and remove another from the Vec, incurring more scooting.)
My challenge is this: I can't use BinaryHeap or LinkedList. To the first, I'm maintaining a sorted store, right, so BinaryHeap immediately comes to mind. The tricky bit is on insertion I have to be able to query the structure for all Entry
with v
less than the to-be-inserted v
so that r
can be computed. Also, BinaryHeap does not allow sorted iteration – I think – without conversion into a Vec first, trouble for compressing and querying the structure. The paper authors discuss storing their samples in a linkedlist but Rust stdlib LinkedList does not allow insertion at arbitrary points.
I have been looking at BinaryHeap's implementation – specifically its Hole technique – with some interest but I don't know if I'd be barking up the wrong tree there or not. If anyone knows of any better structure than Vec that I could use or any wacky unsafe tricks I could make use of I'd greatly appreciate it.
Boy, if you read this all the way to the end thank you very much.