In the nomicon book, I see the following paragraph. Is there anyone explain how the BTreeMap is written to be robust against Ord via std library source?
BTreeMap doesn't really make sense for partially-ordered types, and so it requires that its keys implement Ord . However, BTreeMap has Unsafe Rust code inside of its implementation. Because it would be unacceptable for a sloppy Ord implementation (which is Safe to write) to cause Undefined Behavior, the Unsafe code in BTreeMap must be written to be robust against Ord implementations which aren't actually total — even though that's the whole point of requiring Ord .
I find it somewhat surprising that it's even necessary to use unsafe code implementing BTreeMap, and if it does use unsafe code, this seems a rather obscure detail.
But anyway, I think the point is that unsafe code should not cause UB (Undefined Behaviour) even if a library consumer provides an incorrect implementation of Ord, and the book is explaining this by referencing the implementation of BTreeMap.
Also, importantly, that incorrect but "safe" code shouldn't cause Undefined Behavior in any unsafe blocks.
E.g a logically bad impl Ord for MyStruct { ... } which just uses rand to return an Ordering value at random (involves no unsafe) should not be able to cause UB when the object is inserted into a BTreeMap (or any other code)
The word "robust" might not be the right one here - as per the BTreeMap docs,
It is a logic error for a key to be modified in such a way that the key’s ordering relative to any other key, as determined by the Ord trait, changes while it is in the map. [...] The behavior resulting from such a logic error is not specified, but will not result in undefined behavior. This could include panics, incorrect results, aborts, memory leaks, and non-termination
So a bad Ord implementation might cause some very undesirable behavior which I wouldn't consider "robust" in any way, it just does not cause "Undefined Behavior" which is a fairly specific set of problems
The BTreeMap type is one of the more complicated data structures in std because it does a lot more than a simple binary search tree.
I'll probably butcher the explanation, so you might want to have a read of Rust Collections Case Study: BTreeMap instead. It's an in-depth discussion of how BTreeMap was implemented from one of the implementors and goes over why they needed unsafe.
I have skim-read that before ( without really understanding it ). It seems to come down to this:
"However insertion (and removal) requires us to potentially walk back up the search path to modify ancestors. Therefore, we need to maintain an explicit stack of visited nodes."
What I did in this circumstance is use a trick, like this: fn insert_leaf(&self, db: &DB, pnum: u64, r: &dyn Record, pi: Option<&ParentInfo>) -> bool
So I have a temporary linked list built using the call stack and references which allows me to get information about the parent. Although my BTree is designed to be stored to backing storage, so it's a bit different. I am not claiming it's possible to implement a BTree efficiently in Rust without using unsafe, but I haven't yet understood why it is necessary. Maybe as an exercise I should have a go and see where the problem lies. [ Edit: Hmm... I think I see, I think you would need interior mutability (RefCell), maybe they wanted to avoid the cost of that. ]
Looking at the current implementation of BtreeMap, some points of difference are that they want to store keys and values in an inline array instead of using Vec, presumably to avoid the memory overhead and indirection, and leaf nodes have pointers to their parents so you can efficiently do things like merges.
Your linked list approach would work too, but you run into borrow checker issues you want to mutate because each intermediate node owns its child nodes. It would require interior mutability, and unless you want BTreeMap to be !Send, each node would need to be wrapped in a mutex.
There is also the philosophical argument that BTreeMap's &mut self methods already have unique ownership over the entire tree, so why does it still need interior mutability?
Besides the normal desire for std to be fast out of the box, data structures from the standard library are often used as cannon fodder for people wanting to compare language performance. So it probably made sense to use unsafe code if it allows you to write more efficient code.
Well the first time you have to reallocate the Vec, then it's O(n) time and not a BTree anymore is it?
Also, either you never delete tree nodes and aren't O(n) space anymore, or you have to collect garbage at some point, shifting elements in the Vec, and you're back to O(n) time.
As n elements are inserted, the capacities form a geometric progression. Expanding the array by any constant proportion a ensures that inserting n elements takes O ( n ) time overall, meaning that each insertion takes amortized constant time.
Maybe not, but you also lose some of the nice properties of BTrees: that to do an operation in one part of the tree it won't have to access all its elements. It's a different data structure with different trade offs. By having to resize the Vec you've blown away whatever CPU caches or possibly even page cache you had.
Also, compacting the Vec will shift all indices, necessitating rewriting "pointers" to nodes all throughout the tree.
Na, you can implement it like a free list allocator where you have one massive Vec<Node<K, V>> and each Node is either a thing containing your keys and values, or is "unused" and just contains the index of the next unused node. That way you don't have to shuffle things down when elements or nodes get removed.
Insertions and deletions would become O(1), and you'll have a lot fewer reallocations because a Node contains a block of keys and values instead of a single pair.
One of the main challenges with an index-based BTree is that it is really difficult to shrink its memory usage when the number of items becomes much smaller. It's essentially due to a form of "memory fragmentation" where a few of the large indexes are in use, and shrinking the capacity requires first moving nodes to smaller indexes.
One example of this can be seen in this PR to Tokio, where its DelayQueue type is another index-based data structure. The PR is about making it possible to shrink the queue even when some indexes are large.
(By the way, if you have some spare time, I would appreciate a second set of eyes on that PR — it's relatively large and difficult to review, though I think it's close to done at this point.)
One thing I think to appreciate about BTreeMap is it's general purpose usefulness. I think most of use cases are going to be in the "very small" to "medium" number of elements range. It probably performs just as well as a sorted Vec (and is much more ergonomic) for the "very small" to "small" cases, but would start to shine in the "medium" case where insertions will be much faster while keeping cache locality. And then it scales to the rate "large" case as well without sacrificing time, space, or cachability.
Thinking about this more, I am not convinced. I think it should be possible for the parent to perform mutations at the parent level after the child call returns, by having the child return appropriate information ( about say a child page splitting ).
It would be unusual for a misbehaving Ord to actually affect a data structure because if it does, then it means you must have made some redundant comparisons that gave conflicting answers. If you never perform any redundant comparisons, you're not going to actually be able to notice that the key ordering is inconsistent so it's not going to affect you.
But perhaps in BTreeMap there are some such redundant comparisons, and then it has to make sure it can handle it.
One case where I have seen it being relevant is quicksort combined with an insertion sort phase at the end. After you partition your array using a quicksort pivot, you can use the pivot as a sentinel for insertion sort, so that you save time on skipping array bounds checks and do unsafe indexing, knowing that you're going to hit a larger/smaller pivot before you run out of the array. This then relies on comparisons with the pivot giving the same result as they did the first time, otherwise you're going to do an out of bounds array access.
It doesn't even need to be "redundant" comparisons.
Imagine creating a BTreeMap which inserts pairs where the keys are 1, then 3, then 2. When I insert the 3 I will need to check 1 <= 3, and similarly when I insert 2 I will need to check 1 <= 2. There are no redundant comparisons here, but you could easily mess things up and get [3, 1, 2] if the cmp() method on integers returned a random value each time it gets called.
That wouldn't be "messing things up", that would be the correct behavior if the comparison function on your type said 3 < 1 and then 1 < 2. This is all as expected. It's a correct total order on these three values. No problem, no undefined behavior, everything is fine. You get [3, 1, 2], correctly.
The risk of undefined behavior would appear if now the BTreeMap for some reason made a redundant comparison such as comparing 3 with 2, got an inconsistent result like 3 > 2, and proceeded to somehow mess up its internal pointers or something that would lead to memory corruption and undefined behavior.
But it probably simply doesn't do that, and so it remains safe.