How is BTreeMap written to be robust against Ord implementations

Regarding Ord implementations that return a random value, consider the case where the insert method is called twice with the same key. Since the value needs to be updated, then the passed key needs to be compared again.

So if a different random value is returned, then if the implementation doesn’t consider that case you could end up with, for example [1,2,3,2]. If that case is permitted by the API, then any unsafe code cannot rely on the keys being in order, or even unique.

This means that, to maintain soundness, either the invariants any unsafe code relies on must be maintained, even with inconsistent Ord implementations, or, any unsafe code must stop relying on those “invariants”.

3 Likes

It can't be the same actual key because insert consumes the key that is passed in. It must be a different key, or a clone.

Do you mean equal keys? I.e. keys that compare equal when eq or cmp is called on them?

By "not in order" do you mean according to what cmp said?

I would suggest using letters rather than numbers in examples, because numbers suggest these are i32 with the standard Ord implementation which leads to confusion. We're talking about some other type with a different Ord implementation.

If [a, b1, c, b2] order resulted from insertions, BTreeMap either never compared b1 against b2 during insertion using eq or cmp, or if it did then cmp must have said that b1 is smaller than b2. So in either case they cannot be called "not in order" or "same key" according to what Ord / PartialEq has said so far.

Those are in order according to the results of all the comparisons performed so far during insertions.

The only way for cmp to even have a chance to violate the linear order property is to ask it a redundant question, such as to compare the same two elements twice.

But we never do that, at least in the naive implementation of insert / get that I am thinking about.

1 Like

Summary: it's something to be aware of when implementing BTreeMap, but I highly suspect it doesn't actually cause any difficulties in the practical implementation because there is no need to ever call cmp again on elements that have already been inserted, thus there is no risk of any inconsistency.

All the algorithms manipulating B-trees can move elements around maintaining the order without having the call cmp again -- the order is already implicit in the layout of the data structure.

Fair point, I did misspeak there. A more accurate things for me to have said would be "...cannot rely on the keys being in a consistent order, ...".

If the Ord implementation, and associated Eq implementation are sufficiently strange, then there isn't a single ordering for the keys. Instead of the keys being laid out in "the order" they are merely laid out in an order.

The keys can be shared references or Copy types, so multiple copies of the same bits can end up in the data structure. It is true that the phrase "the same key", that I used, is a little imprecise in this context.

I think you are aware of this, but for others reading: Ord requires the type to implement Eq as well.

If the Ord and Eq implementations are both random then there are not fewer weird possibilities than if the Eq implementation consistently reports some keys as equal. But I think it's easier to think about it if the Eq implementation is sensible, so sure let's assume sensible Eq implementations going forward, unless specified otherwise.

For example, here's a playground where I've implemented Ord in such a way that different calls to cmp give different answers. I've used a single const key called K, and made PartialEq/Eq simply always return true.

Given that setup, and he following main:

fn main() {
    let mut map = BTreeMap::new();
    
    map.insert(K, 1);
    map.insert(K, 2);
    map.insert(K, 3);
    
    for (key, value) in map.iter() {
        println!("{:?}: {}", key, value);
    }
}

... we get the following output:

Key: 3
Key: 1

... which is probably unexpected.

By itself, unexpected behavior like this is not quite Undefined Behavior, but if it breaks assumptions that unsafe code is making, it could cause Undefined Behavior.

I think what you are saying here is true, given we only have a single level in the B-Tree. (That is we haven't inserted more than B elements yet) But in general, in order to implement get when there are multiple levels in the tree I think you need to perform at least one comparison per extra level. (I guess you could linearly search all through the nodes in memory order, but that rather defeats the point of using a B-Tree).

Specifically, I imagine that when determining if the key is already there in either insert or get then the key parameter must be compared with at least one element in the current node (initially the root node) if it has children, in order to know whether the left child or right child should be looked at, or if the element must be in the current node.

So, there would be a case where the same key would need to be compared against the same other key, in a different insert call. If we get a different answer from cmp, then we will end up with some unexpected results, most likely inserting a more than one key that are equal to each other.

That said, I am currently unable to think of a case that could cause UB in a reasonable implementation of insert or get, besides indirectly by insert introducing equal keys. So you may be right that an add-only B-Tree doesn't have this issue.

But I think while implementing remove, it would be easy enough to write unsafe code that causes UB by assuming there cannot be equal keys, particularly when a removal means nodes should be rejoined. Although I admit I have not yet worked out the specifics in detail.

1 Like

In an order that is fully consistent with everything BTreeMap has learned from calling cmp -- and there is always exactly one such order.

It doesn't matter if cmp is random or not: every time BTreeMap::insert calls cmp on the new key, it will decide whether to go left or right based on the result of that comparison, and the key will end up in a place -- the only place -- that is consistent with all the results of these cmp evaluations.

So the final ordering after a sequence of insert calls is going to be a total order fully consistent with the results of cmp.

It doesn't matter that additional, hypothetically possible calls to cmp, would be inconsistent with this order: it can't affect BTreeMap because BTreeMap never makes those hypothetical calls.

First of all, you don't have only one key in your code. There are three keys. Your map's type is BTreeMap<&Key, i32>, and there are three objects of type &Key created there. That you're using a const K doesn't matter: every use of that const makes a new object.

Let's go through and see what happens.

Your cmp implementation on your type Key always cycles through Less, Equal, Greater in this order.

You create the first key, let's call it K1, and insert K1 -> 1. This doesn't need any comparisons, so it is inserted.

Then we create a second key, call it K2. The second insert K2 -> 2 calls cmp(K2, K1) which returns Less. So, it means that K2 < K1 and we have { K2->2, K1->1 } in this order.

The third insert K3 -> 3 starts by calling cmp(K3, K2). This returns Equal. So, based on what we have seen: K3 == K2 < K1. K3->3 replaces K2->2 and we end up with { K3->3, K1->1 }.

That's exactly what ends up being printed.

K3 == K2 < K1 is still a valid, linear order, for all that BTreeMap knows about your keys.

No, it's true regardless of depth.

"More than one key that are equal to each other" doesn't make any difference until cmp actually returns Equal when comparing them. If cmp is never called between two elements, it makes no difference whether they are "equal to each other". If cmp is random, it doesn't even make sense to say they are "equal" if you never make the call.

Whether they would theoretically compare equal or not in a hypothetical call makes no difference to the safety of BTreeMap in this scenario because it never makes that comparison.

I suppose it would be possible, but, like I said, only if we make unnecessary, redundant comparisons. But there is no reason to do that.

remove doesn't need to compare already existing keys amongst each other at all, and if we don't, there is no sense in which they could be considered "equal" as far as remove is concerned. There is never a cmp called during remove on two keys already in the map (or at least, there is no reason to do that). Therefore, whether such a hypothetical cmp would return Equal or not doesn't affect anything, because that comparison is never made.

In general: BTreeMap already knows the exact order of all its keys. The order is implicit in the shape of the tree. So it never needs to compare them again in any operations. It only has to compare new keys that are provided by the user against its existing keys, in order to figure out where they fall.

I was under the impression that the compiler would pass the same pointer in each case. But adding assert_eq!(dbg!(format!("{:p}", key)), format!("{:p}", K)); to the loop has has proven me wrong about that. Perhaps the compiler is allowed to but does not in practice? Either way, I don't think this actually makes a difference to the topic of discussion.

I don't think it makes a difference because when I said "equal to" in the last post, I meant regarding the return value of eq. I definitely could have been clearer about that.

Because the Eq implementation is known to exist, given an Ord implementation, since I had made eq on Key always return true this is technically false.

As an example, this function compiles:

fn ord_eq<O: Ord>(o1: O, o2: O) -> bool {
    o1 == o2
}

And the return value is determined by the eq method.

So given an Eq implementation that always returns true we have K2 < K1 == K3 == K2 == K1 which is not a valid linear order.

Given that, I think we would both agree that it is possible to write an implementation of BTreeMap's API that causes UB by assuming that the output of eq agrees with that of cmp. We might disagree on how likely such an implementation is to be written by accident. Unfortunately, that's something that is hard to find a definitive answer about, without lots of empirical data.

remove would still need to compare the key parameter to some of the existing keys, if there are any. And in that case I think it would be easy to accidentally use the == operator, and depending on the implementation details, I think that this could plausibly cause UB. (In particular I'm imagining using == during a linear search of the keys when it would be faster than binary search due to fewer cache misses.)

Given that incorrectly using == when doing a linear search can reasonably be said to involve redundant comparisons, even when it would be faster with the added comparisons, I suppose I will concede that I cannot think of a case where UB would plausibly occur without either redundant comparisons or incorrect use of Eq. And the Eq usages could be replaced with Ord usages in each case.

But, I think making redundant comparisons accidentally is easier than you appear to think, so handling strange Ord and/or Eq implementations is still something worth being concerned about.

All told, at this point, I don't think we disagree on very much.

I believe BTreeMap doesn't use eq at all, it uses cmp. It wants to figure out whether the element goes on the left, on the right, or we have found it, so cmp is perfect for the task.

Of course it's possible to cause UB. I have been saying it all along.

The point is that it would be a redundant comparison. So all you gotta do to avoid problems is never make redundant comparisons. That's exactly what I said in my first post, and have been beating that horse in subsequent replies:

Sure, I can imagine that some code might freak out if it uses ==, it says "unequal", and then it goes back to cmp and it says Equal. So if that's your implementation strategy, you just gotta handle the Equal => case rather than freaking out. As long as you handle that conflict somehow, it will still work. For instance if you do:

if x == y {
} else if x < y {
} else {
}

it will still work fine, even though it's using both eq and cmp -- you make a valid decision in each case.

But really you just want to match on the result of cmp rather than doing the above.

Point is, it's not that hard not to run into UB regardless of how the user code is implemented. The nomicon book quote mentions BTreeMap, but it's not like BTreeMap has to do some special magic and work hard to avoid UB -- it's just an example that explains high level requirements on unsafe code.

Even if you made your references point to the same Key object, they would still be three reference objects. It's a bit confusing that you have a Key type, but it's not the key type of your map, the key type of your map is &Key rather than Key. The example would be a bit clearer if you made your map be BTreeMap<Key, i32> rather than BTreeMap<&Key,i32> (but the experiment result would be the same regardless).

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.