How to deserialize BTreeSet without re-constructing?

I'm working on a tagger tool, which helps sorting data by interacting with human. So I need making comparisons as few as possible. And BTreeSet from std is suitable (or manual-building matrix?).

#[derive(...)]
struct Data { ... }

impl Ord for Data {
    fn cmp(&self, other: &Self) -> Ordering {
        // manually compare by user
    }
}

datas.into_iter().enumerate().for_each(|(i, d)| {
        btree.insert(d);
        if i & 0b111 == 0 { bincode::serialize (&btree) }
        // Cache it every 8 datas inserted
    }
)

So, on the disk is always correct BTree.


Then I try loading it back

bincode::deserialize...(File::open(...))...

Unluckily, It seems that serde is trying constructing it from scratch again (i.e. Ord::cmp is called just like BTreeSet::from([T;N]) ).

To avoid this, the only way I know is to treat it as a Memory Layout with std::mem::transmuteing it to BTreeSet.

Here's an example for deserializing `BinaryHeap<T>` which is in the same Layout of `Vec<T>`:
let mut btree: BinaryHeap<OrdPath> = unsafe {
     std::mem::transmute(bincode_from::<Vec<OrdPath>>(&cache).unwrap_or_default())
};

However, BTreeSet's inner fields are too completed to manage to do so :sob: .

How can I solve this or other data structure or algorithm to use?

Maybe also caching the human input and replaying the comparison can be one solution...

// manually compare by user

This is likely going to violate the invariants of Ord, since a human could easily say things like:

  • red is larger than green
  • green is larger than blue
  • blue is larger than red
1 Like

Yes.

But with Btree(X binary tree), the data can be divide again and again.

if user want to say:

  • a > b
  • b > c
  • c > a

The fact is, after the first two comparison:

   b
c     a

we can know c < a, user has no chance to say c > a. (i.e c will never be compared with a)

It works the same like building a matrix.

After inserting a and b:

    a  b  c
a   0  1
b  -1  0  1
c     -1  0

when we insert b, we can simply guess the relation between a and c. And inserting c is simply skipped.

Nah, wrong. That's absolutely not how you should do anything, ever. And I mean ever.

Your BinaryHeap example is horrendously wrong. I can't even express how bad it is.

No, it doesn't. It may break in the future. Default-repr structs have no layout guarantees. They never had. Don't transmute them. At all. Not now, and not in the future. And certainly don't pretend that it's safe or correct or sound to do so. Because it is not.

Well, BTreeSet::from() uses sort() internally, and then does a single-pass insertion, so that's not a whole lot of overhead (sorting an already sorted array is very fast). Why don't you just use it? Did you measure that it represents a meaningful bottleneck in your application? (Don't even answer this one. I know you didn't.)

2 Likes

You can't transmute other collections to BTreeSet, but you can transmute BTreeSet<T> to BTreeSet<U> if T and U are transmutable. This will let you replace the Ord impl with something else.

Unfortunately, what you replace it with is difficult to determine. Building a BTreeSet requires all elements are inequal, and it's not a stable guarantee which element is compared to which when constructing the set.

It looks like always returning Greater does what it's supposed to do right now. This is moderately fragile. If you need serious guarantees, make your own set type.

#[derive(Debug, Clone, Serialize, PartialEq, Eq, PartialOrd, Ord, Default, Hash)]
#[serde(bound = "T: Ord + Serialize")]
#[repr(transparent)]
pub struct BTreeSetNoCompare<T>(pub BTreeSet<T>);

#[derive(Debug, Serialize, Deserialize)]
#[repr(transparent)]
struct NoCompare<T>(T);

impl<T> PartialEq for NoCompare<T> {
    fn eq(&self, _other: &Self) -> bool {
        false
    }
}

impl<T> Eq for NoCompare<T> {}

impl<T> Ord for NoCompare<T> {
    fn cmp(&self, _other: &Self) -> std::cmp::Ordering {
        std::cmp::Ordering::Greater
    }
}

impl<T> PartialOrd for NoCompare<T> {
    fn partial_cmp(&self, other: &Self) -> Option<std::cmp::Ordering> {
        Some(self.cmp(other))
    }
}

impl<'de, T> Deserialize<'de> for BTreeSetNoCompare<T>
where
    T: Deserialize<'de>,
{
    fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
    where
        D: serde::Deserializer<'de>,
    {
        let vec = Vec::deserialize(deserializer)?;
        let set: BTreeSet<NoCompare<T>> = BTreeSet::from_iter(vec);
        // SAFETY: `NoCompare<T>` is `repr(transparet)` for `T`
        let set: BTreeSet<T> = unsafe { core::mem::transmute(set) };
        Ok(Self(set))
    }
}

#[cfg(test)]
#[test]
fn check_correctness() {
    let soruce_set: BTreeSet<i32> = (0..100).collect();
    let data = serde_json::to_string(&soruce_set).unwrap();
    let std_set: BTreeSet<i32> = serde_json::from_str(&data).unwrap();
    let no_compare_set: BTreeSetNoCompare<i32> = serde_json::from_str(&data).unwrap();
    assert_eq!(soruce_set, std_set);
    assert_eq!(soruce_set, no_compare_set.0);
}

Note: this goes through Vec because it's faster. Serde's impl for BTreeSet should be going through Vec but doesn't. It also shouldn't require Ord for Serialize.

1 Like

BTreeSet is not good at minimizing the number of comparisons. The whole point of BTreeSet is to be faster than more narrowly branching tree structures, by means of doing linear scans through relatively large nodes! [It looks like the current implementation branches on each node by searching through a list of 6 to 11 elements.] It balances a higher number of assumed-to-be-cheap comparisons against saving in number of allocations, branching, and cache misses.

If your comparison operation is “ask a human”, a BTreeSet is not what you want, at all!

3 Likes

No, there's no guarantee that the compiler gives those the same layout either.

7 Likes

On the scale of “I have so few items that a user can manually do all the comparisons”, you can just use a sorted Vec<T>, and insert stuff with a binary search. That should successfully minimize the number of comparisons.

(If you aren’t doing inserts & lookups, but sort a heap of data at bulk, then there’s even less point in using a map structure. I’m not sure which sorting algorithms miminize number of comparisons off the top of my head.)

4 Likes

Thanks a lot.

Now I know #[repr(transparent)] is needed :face_holding_back_tears:

I'm working on a tagger tool, whose sorting or comparing ops are all manual :sob: i.e (human input). And I want to show user the number of the remained comparisons and the sorted data. Finally, the sort() in From may lead user violate total order...

Now my solution is to cache and re-play the user input instead of caching the BTreeSet.

Yes, this is my first attempt.

But due to human-input, this way may violate total order. (i.e a > b, b > c, c > a)

And sort_by(closure) is hard to work with channel (needed to dispatch Event, or too many loops) and multi-threads: The main TUI loop may exit, and I cannot handle it gracefully in a closure which is FnXX() -> Ordering.

How are you using BTreeSet that those two issues aren’t causing the exact same problems?

3 Likes

I just switched from BinaryHeap to BTreeSet, and brain in a mess...
I mixed Btree and BinaryTree :sob:

I'll switch to binary tree...

A, yes, B-tree:

From Wikipedia, the free encyclopedia

        Not to be confused with Binary tree or B+ tree.

1 Like

To draw a conclusion,

  • BinaryHeap (X): the std impl can only pop one by one, which performs comparing to keep its character.
  • B-Tree/B+Tree (X?): total order cannot be guaranteed in each node (Maybe. I'll check. Edit: This line is not True. Total order would be guaranteed if insert-sort used in each node. std code
  • Binary Tree (X): may unbalanced

The final solution is AVL Tree or Red-Black Tree (Maybe total order can still not able to be guaranteed for keeping balance, I'll check their details. Edit: it’s guaranteed). However, there is not popular crates about them, nor about Binary Tree.

And another is to build comparing matrix, which is quite complex for me.

Editing: After reviewing those trees, I think that B+Tree is exactly what I need, and B-Tree is also Ok.


The number of comparison is O(nlogn).

And I'd better to cache and re-play the user-input instead of caching the tree itself.

Even further, maybe I can infer relation by the user-input, and prevent total-order-invalid input.


Thank you everyone! :pray:

No, a B-Tree or B+Tree guarantees total order in each node. If anything, the user cannot guarantee a total order.

The problem you might have with a B-Tree or B+Tree is that they perform a linear search over the internal nodes' children when performing lookup or insertions. The reason being that for small lengths, a linear scan is more efficient due to how CPUs work. However since you want to actually minimize comparison (and not optimize for CPU speed), this won't be ideal for you.

Btw, the stdlib implements a B-Tree. A B+Tree is a slightly different data structure.

2 Likes

If one node already has abc, and x comes to it.

x is compared with abc one by one, and it seems true that total order guaranteed.

Maybe it's right... I'll check.

Edit: if it is insert-sort used, it’s right. And it is