How is BTreeMap written to be robust against Ord implementations

eadren · December 28, 2021, 6:56am

In the nomicon book, I see the following paragraph. Is there anyone explain how the BTreeMap is written to be robust against Ord via std library source?

BTreeMap doesn't really make sense for partially-ordered types, and so it requires that its keys implement Ord . However, BTreeMap has Unsafe Rust code inside of its implementation. Because it would be unacceptable for a sloppy Ord implementation (which is Safe to write) to cause Undefined Behavior, the Unsafe code in BTreeMap must be written to be robust against Ord implementations which aren't actually total — even though that's the whole point of requiring Ord .

skysch · December 28, 2021, 7:11am

It does things like this, which ensures that the unsafe code doesn't modify the order of the keys in the data structure when modifying it.

geebee22 · December 28, 2021, 3:14pm

I find it somewhat surprising that it's even necessary to use unsafe code implementing BTreeMap, and if it does use unsafe code, this seems a rather obscure detail.

But anyway, I think the point is that unsafe code should not cause UB (Undefined Behaviour) even if a library consumer provides an incorrect implementation of Ord, and the book is explaining this by referencing the implementation of BTreeMap.

( The post is referring to this: How Safe and Unsafe Interact - The Rustonomicon)

dbr · December 29, 2021, 12:04am

Also, importantly, that incorrect but "safe" code shouldn't cause Undefined Behavior in any unsafe blocks.

E.g a logically bad impl Ord for MyStruct { ... } which just uses rand to return an Ordering value at random (involves no unsafe) should not be able to cause UB when the object is inserted into a BTreeMap (or any other code)

The word "robust" might not be the right one here - as per the BTreeMap docs,

It is a logic error for a key to be modified in such a way that the key’s ordering relative to any other key, as determined by the Ord trait, changes while it is in the map. [...] The behavior resulting from such a logic error is not specified, but will not result in undefined behavior. This could include panics, incorrect results, aborts, memory leaks, and non-termination

So a bad Ord implementation might cause some very undesirable behavior which I wouldn't consider "robust" in any way, it just does not cause "Undefined Behavior" which is a fairly specific set of problems

eadren · December 29, 2021, 2:18am

You mean the robust word may confuse readers?

Michael-F-Bryan · December 29, 2021, 4:56am

The BTreeMap type is one of the more complicated data structures in std because it does a lot more than a simple binary search tree.

I'll probably butcher the explanation, so you might want to have a read of
Rust Collections Case Study: BTreeMap instead. It's an in-depth discussion of how BTreeMap was implemented from one of the implementors and goes over why they needed unsafe.

geebee22 · December 29, 2021, 6:36am

I have skim-read that before ( without really understanding it ). It seems to come down to this:

"However insertion (and removal) requires us to potentially walk back up the search path to modify ancestors. Therefore, we need to maintain an explicit stack of visited nodes."

What I did in this circumstance is use a trick, like this:
fn insert_leaf(&self, db: &DB, pnum: u64, r: &dyn Record, pi: Option<&ParentInfo>) -> bool

where ParentInfo is defined as

struct ParentInfo<'a> {
    pnum: u64,
    parent: Option<&'a ParentInfo<'a>>,
}

So I have a temporary linked list built using the call stack and references which allows me to get information about the parent. Although my BTree is designed to be stored to backing storage, so it's a bit different. I am not claiming it's possible to implement a BTree efficiently in Rust without using unsafe, but I haven't yet understood why it is necessary. Maybe as an exercise I should have a go and see where the problem lies. [ Edit: Hmm... I think I see, I think you would need interior mutability (RefCell), maybe they wanted to avoid the cost of that. ]

Michael-F-Bryan · December 29, 2021, 7:38am

Looking at the current implementation of BtreeMap, some points of difference are that they want to store keys and values in an inline array instead of using Vec, presumably to avoid the memory overhead and indirection, and leaf nodes have pointers to their parents so you can efficiently do things like merges.

github.com

rust-lang/rust/blob/8e05bb527c5f00549ea6bc21352638200bceb1a5/library/alloc/src/collections/btree/node.rs#L49-L66


      
          /// The underlying representation of leaf nodes and part of the representation of internal nodes.
          struct LeafNode<K, V> {
              /// We want to be covariant in `K` and `V`.
              parent: Option<NonNull<InternalNode<K, V>>>,
          
              /// This node's index into the parent node's `edges` array.
              /// `*node.parent.edges[node.parent_idx]` should be the same thing as `node`.
              /// This is only guaranteed to be initialized when `parent` is non-null.
              parent_idx: MaybeUninit<u16>,
          
              /// The number of keys and values this node stores.
              len: u16,
          
              /// The arrays storing the actual data of the node. Only the first `len` elements of each
              /// array are initialized and valid.
              keys: [MaybeUninit<K>; CAPACITY],
              vals: [MaybeUninit<V>; CAPACITY],
          }

Your linked list approach would work too, but you run into borrow checker issues you want to mutate because each intermediate node owns its child nodes. It would require interior mutability, and unless you want BTreeMap to be !Send, each node would need to be wrapped in a mutex.

There is also the philosophical argument that BTreeMap's &mut self methods already have unique ownership over the entire tree, so why does it still need interior mutability?

Besides the normal desire for std to be fast out of the box, data structures from the standard library are often used as cannon fodder for people wanting to compare language performance. So it probably made sense to use unsafe code if it allows you to write more efficient code.

geebee22 · December 29, 2021, 7:58am

Maybe you could use a single Vec to store the whole Btree, and indexes is my latest thought.
But that is a very different design.

jessa0 · December 29, 2021, 2:17pm

Well the first time you have to reallocate the Vec, then it's O(n) time and not a BTree anymore is it?

Also, either you never delete tree nodes and aren't O(n) space anymore, or you have to collect garbage at some point, shifting elements in the Vec, and you're back to O(n) time.

geebee22 · December 29, 2021, 2:28pm

I don't think the amortised performance changes.

Re-sizing a Vec is cheap in the amortised sense, O(1). It's a little counter-intuitive but it has been discussed here before.

Wikipedia explains it thus:

As n elements are inserted, the capacities form a geometric progression. Expanding the array by any constant proportion a ensures that inserting n elements takes O ( n ) time overall, meaning that each insertion takes amortized constant time.

jessa0 · December 29, 2021, 2:49pm

Maybe not, but you also lose some of the nice properties of BTrees: that to do an operation in one part of the tree it won't have to access all its elements. It's a different data structure with different trade offs. By having to resize the Vec you've blown away whatever CPU caches or possibly even page cache you had.

Also, compacting the Vec will shift all indices, necessitating rewriting "pointers" to nodes all throughout the tree.

Michael-F-Bryan · December 29, 2021, 2:50pm

Na, you can implement it like a free list allocator where you have one massive Vec<Node<K, V>> and each Node is either a thing containing your keys and values, or is "unused" and just contains the index of the next unused node. That way you don't have to shuffle things down when elements or nodes get removed.

Insertions and deletions would become O(1), and you'll have a lot fewer reallocations because a Node contains a block of keys and values instead of a single pair.

alice · December 29, 2021, 2:50pm

One of the main challenges with an index-based BTree is that it is really difficult to shrink its memory usage when the number of items becomes much smaller. It's essentially due to a form of "memory fragmentation" where a few of the large indexes are in use, and shrinking the capacity requires first moving nodes to smaller indexes.

alice · December 29, 2021, 2:54pm

One example of this can be seen in this PR to Tokio, where its DelayQueue type is another index-based data structure. The PR is about making it possible to shrink the queue even when some indexes are large.

(By the way, if you have some spare time, I would appreciate a second set of eyes on that PR — it's relatively large and difficult to review, though I think it's close to done at this point.)

jessa0 · December 29, 2021, 3:17pm

One thing I think to appreciate about BTreeMap is it's general purpose usefulness. I think most of use cases are going to be in the "very small" to "medium" number of elements range. It probably performs just as well as a sorted Vec (and is much more ergonomic) for the "very small" to "small" cases, but would start to shine in the "medium" case where insertions will be much faster while keeping cache locality. And then it scales to the rate "large" case as well without sacrificing time, space, or cachability.

geebee22 · December 29, 2021, 4:31pm

Thinking about this more, I am not convinced. I think it should be possible for the parent to perform mutations at the parent level after the child call returns, by having the child return appropriate information ( about say a child page splitting ).

tczajka · December 29, 2021, 5:59pm

It would be unusual for a misbehaving Ord to actually affect a data structure because if it does, then it means you must have made some redundant comparisons that gave conflicting answers. If you never perform any redundant comparisons, you're not going to actually be able to notice that the key ordering is inconsistent so it's not going to affect you.

But perhaps in BTreeMap there are some such redundant comparisons, and then it has to make sure it can handle it.

One case where I have seen it being relevant is quicksort combined with an insertion sort phase at the end. After you partition your array using a quicksort pivot, you can use the pivot as a sentinel for insertion sort, so that you save time on skipping array bounds checks and do unsafe indexing, knowing that you're going to hit a larger/smaller pivot before you run out of the array. This then relies on comparisons with the pivot giving the same result as they did the first time, otherwise you're going to do an out of bounds array access.

Michael-F-Bryan · December 30, 2021, 12:12am

It doesn't even need to be "redundant" comparisons.

Imagine creating a BTreeMap which inserts pairs where the keys are 1, then 3, then 2. When I insert the 3 I will need to check 1 <= 3, and similarly when I insert 2 I will need to check 1 <= 2. There are no redundant comparisons here, but you could easily mess things up and get [3, 1, 2] if the cmp() method on integers returned a random value each time it gets called.

tczajka · December 30, 2021, 12:51am

That wouldn't be "messing things up", that would be the correct behavior if the comparison function on your type said 3 < 1 and then 1 < 2. This is all as expected. It's a correct total order on these three values. No problem, no undefined behavior, everything is fine. You get [3, 1, 2], correctly.

The risk of undefined behavior would appear if now the BTreeMap for some reason made a redundant comparison such as comparing 3 with 2, got an inconsistent result like 3 > 2, and proceeded to somehow mess up its internal pointers or something that would lead to memory corruption and undefined behavior.

But it probably simply doesn't do that, and so it remains safe.

Topic		Replies	Views
How is BTreeMap written to be robust against Ord impl? help	2	441	August 31, 2021
Why is std::collections::btree_map::CursorMut::insert_after_unchecked marked unsafe? help	17	330	April 15, 2024
BTreeMap mutable value lookup help	6	233	January 20, 2024
Weak Heap implementation in Rust code review	6	596	May 28, 2022
Memory corruption when using libloading help	12	985	July 3, 2020

How is BTreeMap written to be robust against Ord implementations

Related Topics