BTree-like data structure that allows repeated values

Hello guys,

I'm looking for an implementation of a data structure that allows for fast:

  • find insertion point that preserves ordering
  • get element after that point
  • insertion of a value at that point
  • sorted iteration

I thought about the BTreeSet, however the catch here is that my data can have repeated values (as in Ordering::Equal) and I want to keep all occurences. So maybe I'm looking for BTree?

Searching around, I've found this question that asks for BTree but no mention of allowing for repeated values.

Sorry if the question is dumb in its essence. I'll add more context that may help understanding:

I'm implementing a space-efficient quantile algorithm that allows to get an approximation of a desired quantile (like median) of a huge Iterator (inspired by Spark's approx_percentile). The core of the algorithm is to keep an ordered list of some sampled values (around with some metadata) and update it as we go: adding new samples (insert_one), dropping old samples (compress).

Here is some pseudo-rust code that highlights the uses of the data structure I'm searching for (SomeNiceDataStructure):

// Relevant structs

struct Sample<T: Ord> {
    value: T,
    g: u64,
    delta: u64
}

impl<T: Ord> Ord for Sample<T> { /* delegates to self.value */ }

pub struct Summary<T: Ord> {
    samples: SomeNiceDataStructure<Sample<T>>
}

// Relevant methods

impl<T: Ord> Summary<T> {
    pub fn insert_one(&mut self, value: T) {
        let insert_before = self.samples.insertion_point(value);
        if self.should_insert(value, insert_before) {
            self.samples.insert_before(value, insert_before);
        }
    }

    pub fn compress(&mut self) {
        let mut new_samples = SomeNiceDataStructure::new();
        for sample in self.samples {
            if self.should_keep(sample) {
                new_samples.push(sample); // we know insertion is ordered here
            }
        }
        self.samples = new_samples;
    }
}

Do you know what I should use for this? Either inside the std lib or some external crate. Maybe my requirements are too specific and I may have to roll my own? (I'd strongly prefer not to, I'm not versed well enough in Rust for that :stuck_out_tongue:)

Any suggestion is welcomed!
Thanks a lot

You can emulate MultiSet<T> as Set<(T, usize)> where the second elements are unique numbers such as indices of the data array. And searching range [x..y) corresponds to [(x, MIN), (y, MIN)) etc.
However, I'm not sure BST is needed for your problem. Spark's implementation maintains arrays for samples and recent elements, then periodically merging and compressing two arrays, so the amortized time complexity is only O(1) + (sorting of size K) / K where K is the size of sample buffer. Typically, sorting is much faster than BST operations while having the same log n term.
If you don't want to tune speed/space/accuracy and just need a very simple algorithm, you could just use random sampling and call it a day.

Thanks for the nice advice @pcpthm!

The Set<(T, usize)> seems a nice solution with an acceptable overhead I believe. I'll just have to pay attention to the (not previously explained, sorry!) merge() operation that takes two sampled structures created from different iterators. If the usize starts from 0 every time, there may still be some colliding keys.

However, I tend to agree with you that this is a non-issue from the start. I plan to implement a kind of BufferWriter that will collect incoming values into a temporary vector and when full, sort and merge back into the main structure (like in Sparks' code). In fact, this will be the main use-case of my structure. I was just looking for ways to improve the insert_one()'s perf, but without affecting the batch insertion.

I'll probably just use a Vec<T> and do some binary search in the (rare) calls of insert_one().

About the choice of the algorithm, this is in fact a toy implementation to get going with Rust :slight_smile: (very pleasing experience for now hahha). I've just chosen a nice problem with an interesting solution

(side note: I believe to had found a bug in Spark's merge() implementation. It seems the desired minimum accuracy is not respected. I have to confirm it yet...)

1 Like

Couldn't you also represent a multiset as a Map<T,usize> where the usize represents the number of copies of the element that are present?

2 Likes

I though doing something like you've suggested @droundy, however I could not see how to keep my meta-data around in a nice way. See, I need to wrap each sampled value in Sample { value: T, g: u64, delta: u64 }.

The straight-forward answer is Map<T, Entry> with struct Entry { gs_and_deltas: Vec<(u64, u64)> }. However the Vec does not inspire me much...

Am I missing some better way to express it?

Yeah, that's definitely a problem with my idea.