I'm working on a tagger tool, which helps sorting data by interacting with human. So I need making comparisons as few as possible. And BTreeSet from std is suitable (or manual-building matrix?).
#[derive(...)]
struct Data { ... }
impl Ord for Data {
fn cmp(&self, other: &Self) -> Ordering {
// manually compare by user
}
}
datas.into_iter().enumerate().for_each(|(i, d)| {
btree.insert(d);
if i & 0b111 == 0 { bincode::serialize (&btree) }
// Cache it every 8 datas inserted
}
)
So, on the disk is always correct BTree.
Then I try loading it back
bincode::deserialize...(File::open(...))...
Unluckily, It seems that serde is trying constructing it from scratch again (i.e. Ord::cmp is called just like BTreeSet::from([T;N]) ).
To avoid this, the only way I know is to treat it as a Memory Layout with std::mem::transmuteing it to BTreeSet.
Here's an example for deserializing `BinaryHeap<T>` which is in the same Layout of `Vec<T>`:
let mut btree: BinaryHeap<OrdPath> = unsafe {
std::mem::transmute(bincode_from::<Vec<OrdPath>>(&cache).unwrap_or_default())
};
However, BTreeSet's inner fields are too completed to manage to do so .
How can I solve this or other data structure or algorithm to use?
Nah, wrong. That's absolutely not how you should do anything, ever. And I mean ever.
Your BinaryHeap example is horrendously wrong. I can't even express how bad it is.
No, it doesn't. It may break in the future. Default-repr structs have no layout guarantees. They never had. Don't transmute them. At all. Not now, and not in the future. And certainly don't pretend that it's safe or correct or sound to do so. Because it is not.
Well, BTreeSet::from() uses sort() internally, and then does a single-pass insertion, so that's not a whole lot of overhead (sorting an already sorted array is very fast). Why don't you just use it? Did you measure that it represents a meaningful bottleneck in your application? (Don't even answer this one. I know you didn't.)
You can't transmute other collections to BTreeSet, but you can transmute BTreeSet<T> to BTreeSet<U> if T and U are transmutable. This will let you replace the Ord impl with something else.
Unfortunately, what you replace it with is difficult to determine. Building a BTreeSet requires all elements are inequal, and it's not a stable guarantee which element is compared to which when constructing the set.
It looks like always returning Greater does what it's supposed to do right now. This is moderately fragile. If you need serious guarantees, make your own set type.
#[derive(Debug, Clone, Serialize, PartialEq, Eq, PartialOrd, Ord, Default, Hash)]
#[serde(bound = "T: Ord + Serialize")]
#[repr(transparent)]
pub struct BTreeSetNoCompare<T>(pub BTreeSet<T>);
#[derive(Debug, Serialize, Deserialize)]
#[repr(transparent)]
struct NoCompare<T>(T);
impl<T> PartialEq for NoCompare<T> {
fn eq(&self, _other: &Self) -> bool {
false
}
}
impl<T> Eq for NoCompare<T> {}
impl<T> Ord for NoCompare<T> {
fn cmp(&self, _other: &Self) -> std::cmp::Ordering {
std::cmp::Ordering::Greater
}
}
impl<T> PartialOrd for NoCompare<T> {
fn partial_cmp(&self, other: &Self) -> Option<std::cmp::Ordering> {
Some(self.cmp(other))
}
}
impl<'de, T> Deserialize<'de> for BTreeSetNoCompare<T>
where
T: Deserialize<'de>,
{
fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
where
D: serde::Deserializer<'de>,
{
let vec = Vec::deserialize(deserializer)?;
let set: BTreeSet<NoCompare<T>> = BTreeSet::from_iter(vec);
// SAFETY: `NoCompare<T>` is `repr(transparet)` for `T`
let set: BTreeSet<T> = unsafe { core::mem::transmute(set) };
Ok(Self(set))
}
}
#[cfg(test)]
#[test]
fn check_correctness() {
let soruce_set: BTreeSet<i32> = (0..100).collect();
let data = serde_json::to_string(&soruce_set).unwrap();
let std_set: BTreeSet<i32> = serde_json::from_str(&data).unwrap();
let no_compare_set: BTreeSetNoCompare<i32> = serde_json::from_str(&data).unwrap();
assert_eq!(soruce_set, std_set);
assert_eq!(soruce_set, no_compare_set.0);
}
Note: this goes through Vec because it's faster. Serde's impl for BTreeSet should be going through Vec but doesn't. It also shouldn't require Ord for Serialize.
BTreeSet is not good at minimizing the number of comparisons. The whole point of BTreeSet is to be faster than more narrowly branching tree structures, by means of doing linear scans through relatively large nodes! [It looks like the current implementation branches on each node by searching through a list of 6 to 11 elements.] It balances a higher number of assumed-to-be-cheap comparisons against saving in number of allocations, branching, and cache misses.
If your comparison operation is “ask a human”, a BTreeSet is not what you want, at all!
On the scale of “I have so few items that a user can manually do all the comparisons”, you can just use a sorted Vec<T>, and insert stuff with a binary search. That should successfully minimize the number of comparisons.
(If you aren’t doing inserts & lookups, but sort a heap of data at bulk, then there’s even less point in using a map structure. I’m not sure which sorting algorithms miminize number of comparisons off the top of my head.)
I'm working on a tagger tool, whose sorting or comparing ops are all manual i.e (human input). And I want to show user the number of the remained comparisons and the sorted data. Finally, the sort() in From may lead user violate total order...
Now my solution is to cache and re-play the user input instead of caching the BTreeSet.
But due to human-input, this way may violate total order. (i.e a > b, b > c, c > a)
And sort_by(closure) is hard to work with channel (needed to dispatch Event, or too many loops) and multi-threads: The main TUI loop may exit, and I cannot handle it gracefully in a closure which is FnXX() -> Ordering.
BinaryHeap (X): the std impl can only pop one by one, which performs comparing to keep its character.
B-Tree/B+Tree (X?): total order cannot be guaranteed in each node (Maybe. I'll check. Edit: This line is not True. Total order would be guaranteed if insert-sort used in each node. std code
Binary Tree (X): may unbalanced
The final solution is AVL Tree or Red-Black Tree (Maybe total order can still not able to be guaranteed for keeping balance, I'll check their details. Edit: it’s guaranteed). However, there is not popular crates about them, nor about Binary Tree.
And another is to build comparing matrix, which is quite complex for me.
Editing: After reviewing those trees, I think that B+Tree is exactly what I need, and B-Tree is also Ok.
The number of comparison is O(nlogn).
And I'd better to cache and re-play the user-input instead of caching the tree itself.
Even further, maybe I can infer relation by the user-input, and prevent total-order-invalid input.
No, a B-Tree or B+Tree guarantees total order in each node. If anything, the user cannot guarantee a total order.
The problem you might have with a B-Tree or B+Tree is that they perform a linear search over the internal nodes' children when performing lookup or insertions. The reason being that for small lengths, a linear scan is more efficient due to how CPUs work. However since you want to actually minimize comparison (and not optimize for CPU speed), this won't be ideal for you.
Btw, the stdlib implements a B-Tree. A B+Tree is a slightly different data structure.