Which trait bounds should go on a B+ tree?

This is an interesting problem that has to do with ownership, memory safety, and performance.
I discussed it with a few folks over IRC, but would like to hear what a broader audience thinks.

Let's call this data structure a BPlusTreeMap<K, V>, which is going to be similar to BTreeMap<K, V>.
Obviously, keys must be ordered, so there's our first trait bound: K: Ord.

I want to support just two operations:

  1. fn insert(&mut self, key: K, value: V)
  2. fn search(&self, key: &K) -> Option<&V>

Now, in a B+ tree leaf nodes contain elements (K, V) and internal nodes contain copies K of keys, whereas a B tree has just one (K, V) for each inserted element (maybe in a leaf node, maybe in an internal node).
Ok, so a B+ tree must be able to copy keys for it's internal use, therefore we have another bound: K: Clone.

Suppose we want to create a BPlusTreeMap<String, i32>. There is going to be a ton of string cloning, which will have a horrible effect on performance. However, we don't really need clones of keys. Having unsafe shallow copies of keys in internal nodes would be just fine. So in this example, we could just memcpy those string keys in order to have copies in internal nodes. Then we just have to be careful not to call destructors on those copies. This eliminates the K: Clone bound.

Ok, this is a good performance improvement. But now we have a memory safety problem!
Suppose that K is a value which contains an ArcCell<Option<Box<i32>>> and on every comparison does something with it, perhaps drops the Box or allocates a new one. If we are sometimes operating on the real K and sometimes on a shallow copy of it, then they will eventually go out of sync and we might have double frees! This is bad.

So how do we fix this? That's the hard part where I need help :slight_smile:

We figured out a possible solution over IRC. It seems that what we really want is to forbid keys which are cell-like. In other words, forbid keys which have interior mutability. We need a bound K: NotCell, where NotCell is defined like this:

#![feature(optin_builtin_traits)]

trait NotCell {}

impl NotCell for .. {}
impl<T> !NotCell for std::cell::UnsafeCell<T> {}

I'm very happy with this solution, but would need some reassurance that this is indeed correct.

What do you think? Do you have any better ideas?

1 Like

NotCell won't work as is, because you can have interior mutability via unsafety. So, the trait should be unsafe and !implemented for raw pointers. (EDIT: Hm, actually it would work because unsafety will boil down to UnsafeCell :slight_smile: )

With this modification, NotCell will be surprisingly similar to std::Sync! But, Sync won't suffice either: atomic integers are Sync and contain interior mutability, so they should be forbidden as well.

And even if we forbid everything we need, I don't think it'd be a right solution: we don't want to constrain the type, we want to constrain PartialEq::cmp method implementation. I don't know how to do that.

Is there perhaps some way of completely dodging the problem? Like using array of keys and indices? Or storing each key exactly once at the topmost node, and storing empty holes below?

And another idea: what about guarding against this edge case at runtime? memcpy the key before the comparison and assert! that it does not changes.

As for the immediate problem: It sounds like you don't want shallow copies. You probably want (unsafe) references/pointers instead. Or accessor methods built on such references/pointers. This would ensure that wherever the pointer is dereferenced (or the accessor is called), it reads the current state of the referred key.

However, I don't think that's even the right direction. If you go to the documentation of BTreeMap or HashMap, it simply states that it is a logical error of the user code if keys are mutated while in the map in a way that this changed the behavior of Eq/Ord/Hash. Then the docs go on to say that while this might (and will likely) cause the maps to return incorrect results, the maps are themselves implemented in a way that this can't lead to
memory unsafety.

You should probably do the same: shift the burden of logical correctness and upholding semantic immutability to the user, but do not rely on this for enforcing memory safety.

Edit: I just reaized this thread is 4 years old. The above advice might still be useful for future writers of data structures.