How can I optimize this data-structure?

Summary: I need to maintain a large number of things in a sorted structure with iteration, ith insertion and lookup. I'm presently using a Vec but the performance is abysmal. I could really use recommendations on alternative structures to use, fancy tricks.


Hidy folks,

I have an implementation of "Effective Computation of Biased Quantiles Over Data Streams" in my quantiles library that works pretty well if the number of samples you have is low-ish, say a handful of thousands. The implementation is here. CKMS is a data structure that allows you to answer approximate quantile queries on a stream of data in bounded space. The underlying algorithm has this neat trick where the samples stored into the structure are compressed in such a way that any quantile queries post-compression have a guaranteed error bound on then. You save space and get answers with a guaranteed accuracy. Tidy.

Now, the problem I've got is I'm finding a need for CKMS structures with significantly more samples and the insert performance with my implementation is just killing me. I've been fiddling with the implementation, have hit a brick wall and could use some advice on where to go next. I've put together an in-progress branch that has criterion benchmarks and some fiddling micro-optimizations. You can find that branch on Github. I run the benchmarks like cargo bench -- --test --nocapture.

The basic problem I have with my implementation is this: quantiles/CKMS stores its samples, sorted, in a Vec. The algorithm needs access to i-1, i and i+1 elements in the underlying store – which is why I use a Vec – and must be able to perform sorted iteration on the elements.

Here, let's have a look at insertion. The whole function starts here but the important part is:

} else {
    // inserting somewhere in the middle
    let mut idx = 0;
    let mut r = 0;
    for smpl in self.samples.iter() {
      match smpl.v.partial_cmp(&v).unwrap() {
              cmp::Ordering::Less => {
                  idx += 1;
                  r += smpl.g
              }
              _ => break,
      }
     }
     // To insert in the middle we have to know two things:
     //
     //  * the correct point of insertion, 'idx'
     //  * the sum of all Entry.g prior to 'idx', called 'r'
     //
     // We could avoid the 'idx' computation if we could get at 'r'
     // some other way.
     (idx, invariant(r as f64, self.error) - 1)
   };

   self.samples.insert(
       idx,
       Entry {
           v: v,
           g: 1,
           delta: delta,
       },
   );

In the paper v is the value being inserted, g is the difference in ranks between samples i and i-1 and delta is the difference between the lowest and greatest possible ranks of i. Just, you know, incidentally. The important part here is that to insert we have to do a linear trawl through the Vec to compute r so that we can compute delta for the newly inserted item, as the comment points out, and then when we insert all the existing items get scooted down one in the Vec.

Compression – defined here – is much smaller than insertion.

fn compress(&mut self) {
    if self.samples.len() < 3 {
        return;
    }

    let mut s_mx = self.samples.len() - 1;
    let mut idx = 0;
    let mut r: u32 = 1;

    while idx < s_mx {
        let cur_g = self.samples[idx].g;
        let nxt_v = self.samples[idx + 1].v;
        let nxt_g = self.samples[idx + 1].g;
        let nxt_delta = self.samples[idx + 1].delta;

        if cur_g + nxt_g + nxt_delta <= invariant(r as f64, self.error) {
            self.samples[idx].v = nxt_v;
            self.samples[idx].g += nxt_g;
            self.samples[idx].delta = nxt_delta;
            self.samples.remove(idx + 1);
            s_mx -= 1;
        } else {
            idx += 1;
        }
        r += 1;
    }
}

We perform a pairwise step through the samples Vec, check against the invariant property the paper defines and, when possible, compress two samples into one (which turns out to be, modify one in-place and remove another from the Vec, incurring more scooting.)

My challenge is this: I can't use BinaryHeap or LinkedList. To the first, I'm maintaining a sorted store, right, so BinaryHeap immediately comes to mind. The tricky bit is on insertion I have to be able to query the structure for all Entry with v less than the to-be-inserted v so that r can be computed. Also, BinaryHeap does not allow sorted iteration – I think – without conversion into a Vec first, trouble for compressing and querying the structure. The paper authors discuss storing their samples in a linkedlist but Rust stdlib LinkedList does not allow insertion at arbitrary points.

I have been looking at BinaryHeap's implementation – specifically its Hole technique – with some interest but I don't know if I'd be barking up the wrong tree there or not. If anyone knows of any better structure than Vec that I could use or any wacky unsafe tricks I could make use of I'd greatly appreciate it.

Boy, if you read this all the way to the end thank you very much.

(I have not actually read this thoroughly, so I might say something stupid. I do not understand requirements fully, so I might say even mote supid things)

I need to maintain a large number of things in a sorted structure with iteration, ith insertion and lookup.

I think you can use a search tree (AVL or B-Tree) which additionally stores number of elements below each node. B-Tree will give you sortedness and fast insertion/lookup. Storing the number of elements at each node allows to quickly find the nth element.

3 Likes

I may be missing a step but I don’t think I can use a search tree (from the standard library), considering the way compression and query have to work. Compression does an ordered iteration though the samples, comparing the current element, i, to the next, i+1 and potentially merges these items together while continuing the iteration. Query does much the same but with i-1 and i.

Am I off the mark of what you are getting at?

Sound like you need custom binary search tree or skiplist.
Based on quick read of the paper you need a custom one (that means using SortedSet would not help you).

Let's say you go with skiplist (and using notation from the paper and Skip list - Wikipedia).
Finding v[i] < v is quite trivial in skiplist in O(lg n) time.
Summing all g[j] from 0 to i-1 will still be slow in naive implementation.
That's where skiplist will shine, because you can keep the sum also in higher layers a thus can sum things in O(lg n) time.

But compress will still be slow since it is travelsal through whole array. But you do not need to run it often (as paper says).

Also read section 5 of the paper it says many things about implementation issues.

4 Likes

Aha, so you want to modify the data structure during the iteration? That's definitely the piece I was missing!

Yeah, that might be tricky to implement in a search tree, but the big-O complexity would be right. To merge elements, you can first remove both of them and then insert the merged result. This will change the structure of the tree so you won't be able co continue the iteration (and Rust borrow checker will yell at you), so you'll need to remember the position in the tree (like, "elements >= 92.0"), stop iteration, do remove-remove-insert trick, find the position again and resume iteration. Another approach would be to postpone actual modifications: during iteration, collect a sequence of operations like "merge these two, than these two, etc" , and then apply then once traversal is done.

Also important thing about running compress comes from section 5 (e is epsilon):

Instead of periodically running a full
Compress, we amortize this work by scanning a fraction
of the data structure for every insertion: we process `s/2e` tuples
per item. The goal is to lower the worst-case processing
at any time step and thus keep up with the data stream.

Also we should mention poor man implementation (which is imho easier than any skiplist and faster than just Vec and sometimes faster than skiplists due to caching and other effects) of sorted set with insertions with time complexity around O(sqrt(n)).

You will maintain something like Vec<Vec<_>>, which resembles two layer B-tree.
You will have upper bound U on size of inner Vec.
At start you will have just one inner Vec.

Whenever you insert item you will just find correct position (remember you can also keep auxiliary data about each inner Vec like size, sum of g[i], ...) by mostly linear scan in outer Vec and linear scan in one inner Vec. And insert it there (in linear time of inner Vec size).
Whenever some inner Vec is just too big, you split it into two.

3 Likes

Okay, so progress. I've gone and done the poor-man's approach first on account of I figured it'd be doable in a day. This branch has my changes. You can find the poor-man's approach in store.rs.

I did two benchmarking runs, one using Store and one using Vec. This is the Store run:

Benchmarking bench_insert_65535
> Warming up for 3.0000 s
> Collecting 100 samples in estimated 2788.049398057143 s
test ckms::u16::bench_insert_65535 has been running for over 60 seconds
> Found 10 outliers among 99 measurements (10.10%)
  > 1 (1.01%) low mild
  > 3 (3.03%) high mild
  > 6 (6.06%) high severe
> Performing linear regression
  >  slope [546.92 ms 549.23 ms]
  >    R^2  0.9867048 0.9866113
> Estimating the statistics of the sample
  >   mean [549.12 ms 553.25 ms]
  > median [548.18 ms 550.14 ms]
  >    MAD [3.3156 ms 7.2013 ms]
  >     SD [6.7260 ms 13.839 ms]

This is the Vec run:

Benchmarking bench_insert_65535
> Warming up for 3.0000 s
> Collecting 100 samples in estimated 2793.8877824142855 s
test ckms::u16::bench_insert_65535 has been running for over 60 seconds
> Found 17 outliers among 99 measurements (17.17%)
  > 6 (6.06%) low mild
  > 5 (5.05%) high mild
  > 6 (6.06%) high severe
> Performing linear regression
  >  slope [495.93 ms 522.53 ms]
  >    R^2  0.3730326 0.3617355
> Estimating the statistics of the sample
  >   mean [493.12 ms 507.94 ms]
  > median [490.67 ms 492.75 ms]
  >    MAD [2.8913 ms 8.1515 ms]
  >     SD [11.446 ms 63.188 ms]
bench_insert_65535: Comparing with previous sample
> Performing a two-sample t-test
  > H0: Both samples have the same mean
  > p = 0
  > Strong evidence to reject the null hypothesis
> Estimating relative change of statistics
  >   mean [-10.433% -7.7098%]
  > median [-10.682% -10.217%]
  > mean has improved by 9.43%
  > median has improved by 10.47%

Not exactly a rousing success, I'm afraid. I guess my next tact – before going all-in on a skiplist implementation – will be to shave a little time by storing a sorted insertion buffer and doing some pre-computation on each node. Still very amenable to suggestions, pointers for where I may have goofed in making the poor-man's skiplist.

1 Like

Oh boy, this is what I call a miscommunication (I should not write suggestions during eating breakfast :confused: ).

Problem is here (https://github.com/postmates/quantiles/blob/8cdd9af6c6a4cee895b36a32228b06c7995344a4/src/ckms.rs#L303):

for smpl in self.samples.iter() {
   ...
}

No matter how much you optimize the insert procedure, but if you keep iterating over samples before each insert it would be slow.

I will express poorman implementation little bit clearly now:

You will have:

type Store = Vec<InnerData<_>>;

struct InnerData {
  data: Vec<_>,
  sum_of_g: _,
  // maybe other metadata
}

And you need to use that sum_of_g to iterate over array in much faster way (you mention it here).
And also instead of comparing to each element in inner data, you can just compare to last element to check whether it is worth looking into or move on.

3 Likes

Well hey, I've appreciated the help. I figured today I'd start mushing pre-computed data into the Store and am glad that's inline with your thinking as well. Hoping to have results to share here in a bit.

2 Likes

Hmm, middle insertion and subsequence folds? You might have good luck with finger trees, if you can fit the sum into the Measurement they support.

(And if you only need it emphemerally, there's probably a bunch of simplifications you can make to the implementations.)

1 Like

Aaaaand results. I've got to the end of the poor-man's approach and have pushed computation into the nodes – called Inner in my branch. Compression is a touch more aggressive than quantiles/master which nets a win as well. The updated branch is at bc906d9f8581c74fae55481c4e6b8f90d1bcadeb.

Benchmarking bench_insert_65535
> Warming up for 3.0000 s
> Collecting 100 samples in estimated 461.24 s
test ckms::u16::bench_insert_65535 has been running for over 60 seconds
> Found 4 outliers among 99 measurements (4.04%)
  > 1 (1.01%) low mild
  > 3 (3.03%) high mild
> Performing linear regression
  >  slope [91.072 ms 91.250 ms]
  >    R^2  0.9972804 0.9972740
> Estimating the statistics of the sample
  >   mean [91.094 ms 91.341 ms]
  > median [91.054 ms 91.295 ms]
  >    MAD [380.69 us 593.37 us]
  >     SD [508.63 us 747.06 us]
bench_insert_65535: Comparing with previous sample
> Performing a two-sample t-test
  > H0: Both samples have the same mean
  > p = 0
  > Strong evidence to reject the null hypothesis
> Estimating relative change of statistics
  >   mean [-82.042% -81.499%]
  > median [-81.492% -81.412%]
  > mean has improved by 81.72%
  > median has improved by 81.45%

!!!

2 Likes

Neat, thanks! I've never properly read up on finger trees and I'll study them some.