Help with simple B+ tree implementation

Hi everyone. I've posted here a few times before and I am very thankful for all of the helpful and prompt responses I have received.

That being said, I am trying (for the 5th time) to get myself to understand Rust. I really appreciate the idea of compile time safety checks and it seems like everyone in my field (kernel, filesystems, and database programming) believes that Rust will be the language of the future. However, every time I try to get into it I can't help but feeling like Sideshow Bob.

I am trying to implement a simple B+ Tree that stores Box<u64> values indexed by u64 keys. Below, I have pasted what I have so far (it is really ugly and it currently only supports find() and an insert() method that doesn't compile and is missing some logic). I have left inline comments with questions I have. I would appreciate it if anyone could answer any of the questions I have here. I'm sorry that I don't have a single specific question to ask, but I really feel like I'm just banging my head against the wall at every turn. If it makes more sense for me to split this up into separate topics I can do that too (I'm not sure which is easiest).

Advice on different approaches to learning Rust is welcome as well. I have tried following along with a few guides including Rust through too many linked lists and the official handbook, but I always feel like I've come away more confused after reading them.

Thanks for any help.

struct InteriorNode {
    keys: Vec<u64>,
    children: Vec<Box<Node>>,
}

struct KeyValuePair {
    key: u64,
    value: Box<u64>,
}

struct LeafNode {
    kvs: Vec<KeyValuePair>,
}

enum Node {
    Interior(InteriorNode),
    Leaf(LeafNode),
}

struct BPTree {
    n: usize,
    /*
     * The type here has some crazy levels of nesting. Assuming I have a 
     * single key/value in the tree and I want to get a reference to the value,
     * I need to unwrap the option, dereference the box, match the Leaf type,
     * grab a reference to the Leafnode::kvs vector, get first value from the
     * Vec, and finally dereference the box. None of this process even dealt
     * with iterating down the actual structure of the BPTree. I understand
     * the value of each of these layers, but this in turn makes all of my
     * code extremely nested and hard to follow. Is there a way I can do this
     * without all of the nesting? Can I write helper functions to hide some
     * of this? If I can write helper functions do I need to write all the
     * boilerplate myself every time I have a complex structure like this?
     */
    root: Option<Box<Node>>,
}

impl BPTree {
    fn new_leaf(&self) -> Box<Node> {
        Box::new(Node::Leaf(LeafNode {kvs: Vec::new()}))
    }

    fn new_interior(&self) -> Box<Node> {
        /* This seems really verbose. Is there a shorthand way to do this? */
        Box::new(Node::Interior(InteriorNode {children: Vec::new(), keys: Vec::new()}))
    }

    fn new(n: usize) -> BPTree {
        BPTree { n: n, root: None }
    }

    fn split_leaf(&mut self, leaf: &mut LeafNode) {
        /*
         * I got stuck in a few ways here. Basically, this function is supposed to split
         * a leaf node that is overfull into 2 half-full leaves and then add both to the
         * parent layer. If this insertion results in an interior node being overfull, that
         * node is also split and this process continues up the entire tree until either
         * an underfull node is found or we hit the root node. If we hit the root, we split
         * it and create a new root level of the BPTree. My issues are as follows:
         * 
         * 1) As I stated below in insert(), I cannot call this function at its current place
         *    in the code due to an error about borrowing self twice.
         * 2) I'm not sure how to do parent pointers idiomatically. Should I create raw
         *    pointers and drop into unsafe blocks? Should I use Rc::weak pointers? Can I use
         *    these without using a regular Rc pointer?
         * 3) Ideally, in c I would create a function to add a node as a child of another to
         *    ensure that every time I add a child to a node the parent pointer also gets
         *    updated. What does the ownership model look like here? Would the parent and
         *    child both have mutable references to each other? Ins't that not allowed?
         * 4) Due to the other issues, I have not really gotten close to implementing this
         *    method and I have no idea what other issues I might hit.
         */
        unimplemented!();
    }

    fn insert(&mut self, key: u64, val: u64) {
        match self.root {
            Some(ref mut node_box) => {
                let mut node = node_box.as_mut();

                'search: loop {
                    match node {
                        /*
                         * Is the only way to get to the value contained within an enum
                         * to use a match statement?
                         */
                        Node::Interior(ref mut int) =>  {
                            for (i, search_key) in int.keys.iter().enumerate() {
                                /*
                                 * I don't understand why sometimes dereferencing happens
                                 * automatically and sometimes i need to do it manually.
                                 * Can anyone explain this to me?
                                 */
                                if key < *search_key {
                                    node = int.children.get_mut(i).unwrap();
                                    /*
                                     * Why did I need to add a label to my loop? I was not
                                     * able to figure out what 'continue' means in the context
                                     * of a 'match' statement (at least with a quick google).
                                     * Does it mean the equivalent of a "fall-thru" switch case? 
                                     */
                                    continue 'search;
                                }
                            }

                            node = int.children.last_mut().unwrap();
                        },
                        Node::Leaf(ref mut leaf) => {
                                for (i, search_kv) in leaf.kvs.iter_mut().enumerate() {
                                    if key < search_kv.key {
                                        /*
                                        * I can't create a method to make a new KeyValuePair from a
                                        * key and value. I would like to be able to do this so that I have
                                        * a concise way of creating KeyValuePairs. I wanted it to be a
                                        * method so that I can eventually make BPTree generic, allowing the
                                        * method to automatically create the right generic KeyValuePair.
                                        */
                                        leaf.kvs.insert(i, KeyValuePair { key: key, value: Box::new(val) });
                                        break;
                                    }
                                }

                                /*
                                 * I am missing a case here. If the key to be inserted is higher than any
                                 * existing keys nothing will get inserted. In c, I would create the variable
                                 * 'i' before the loop and check if i > kvs.len() after the loop. In python
                                 * I would use a for...else statement. I'm not sure what the idiomatic thing
                                 * to do is here.
                                 */
                                if leaf.kvs.len() > self.n {
                                    /*
                                    * Compilation error: can't borrow self twice. I don't understand why
                                    * this counts as borrowing twice. From my perspective i'm trying to
                                    * pass my mutable, borrowed reference into the function.
                                    */
                                    self.split_leaf(leaf);
                                }
                        }
                    }
                }
            },
            None => {
                /*
                 * In any other language I would structure this differently. I would
                 * have an 'if' statement at the top of my code to guarantee that
                 * root is allocated. Then I would run the normal insertion code that
                 * is currently in the 'Some' case. Instead it seems I need to duplicate
                 * the insertion code or add unnecessary 'None' checks.
                 */
                let mut root = self.new_leaf();
                
                /*
                 * Why don't I need 'ref mut' on the leaf variable?
                 */
                if let Node::Leaf(leaf) = root.as_mut() {
                    leaf.kvs.push(KeyValuePair { key: key, value: Box::new(val)});
                } else {
                    /*
                     * I just made sure 'root' was a leaf, but I have no way to get a
                     * ref to the inner struct without a match or an 'if let' statement.
                     * I find myself writing a lot of 'else panic' statements like this
                     * one because I want to make sure all of the cases of the 'if'
                     * are handled.
                     */
                    panic!("Root node is not a leaf");
                }

                self.root = Some(root);
            }
        }
    }

    /*
     * Do I need to write another version of this function for fetching
     * a mutable reference? Wouldn't that function be almost entirely
     * copy-pasted? That seems like a lot of code duplication / boilerplate.
     */
    fn find(&self, key: u64) -> Option<&Box<u64>> {
        match self.root {
            Some(ref node_box) => {
                /*
                 * This kind of hurts my brain. So this is a MUTABLE variable
                 * named 'node' that refers to an immutable reference to a
                 * value inside a box? Do I have that right? If I take away the
                 * 'mut' keyword I get compilation errors when setting 'node'
                 * later in the code.
                 */
                let mut node = node_box.as_ref();

                'search: loop {
                    match node {
                        Node::Interior(int) =>  {
                            for (i, search_key) in int.keys.iter().enumerate() {
                                if key < *search_key {
                                    node = int.children.get(i).unwrap().as_ref();
                                    continue 'search;
                                }
                            }

                            node = int.children.last().unwrap().as_ref();
                        },
                        Node::Leaf(leaf) => {
                            for search_kv in leaf.kvs.iter() {
                                if key == search_kv.key {
                                    return Some(&search_kv.value);
                                }
                            }

                            return None;
                        }
                    }
                }
            },
            None => {
                return None;
            }
        }
    }
}

fn main() {
    let mut bt = BPTree::new(4);
    
    bt.insert(1, 1);
    let val = bt.find(1);

    println!("Hello, world!: {}", *val.unwrap());
}

I'm also still learning Rust. I'm confused by the use of Box in KeyValuePair. Also, is keys.len() == children.len() ?

I would collapse the four above to:

enum Node {
  Leaf(Vec<(u64, u64)>),
  Interior(Vec<(u64, Box<Node>)>)
}

It loses a bit of the nice naming, but might make the algorithms more concise.

I'm confused by the use of Box in KeyValuePair

The idea was that value would eventually be a generic type. I guess in retrospect for a b+ tree it should be required t be a Copy type so that they can squeeze into a single Vec array (which is the point of a B+ tree after all).

keys.len() == children.len() ?

No. For interior nodes keys.len() +1 == children.len(). The idea is that an interior node has a set of children and a set of keys which act as delimiters between them. https://upload.wikimedia.org/wikipedia/commons/3/37/Bplustree.png

For leaf nodes, keys.len() == values.len(). So I guess your naming suggestion would work for leaf nodes but not interior nodes.

Are you keeping the keys separate rather than interleaved for better caching when binary searching through the keys to find which child node to jump to?

I'm not really optimizing that heavily right now. I'm mostly just doing it so that I can easily iterate through the keys as a simple Vec. I could do it as a Vec<(Value, Key)> + an extra Value at the end but that seems a bit uglier.

If I did not care about caching, I would use Vec<(Value, Key)> with a dummy +inf or -inf, as it enforces that keys.len() + 1 = values.len().

I would also modify insert/split to return a Option<Node, Node>

In the comments above, you mention that one problem is: a insert/split might cause parent to be overfull and thus need to be split. I would handle this recursively by having the function return Option<Node, Node>. A None signals to the parent "everything is fine". A Some(Node, Node) indicates to parent current node was split into two. This "bubbling up" of Option<Node, Node> might solve the issue you're facing.

It is not surprising that you are having trouble implementing a B+ tree, as it is one of the harder things to make in Rust. The one in the standard library has resorted to using raw pointers for a reason. Regardless, let's take a look at your questions.

Deeply nested types: You will have to deal with some level of nesting with this type, but you have more than strictly needed.

  1. The root field in BPTree can just be a Node. You do not need the option nor the box. An empty tree can be a leaf node with an empty vector.
  2. You do not need a box in the KeyValuePair. You also will not need it for generic types.
  3. The Vec<Box<Node>> in InteriorNode can just be a Vec<Node>.

In general the vectors provide sufficient indirection for the types to have finite size. As @anon80458984 also noted, you can also inline the types into the enum.

Note that Copy does not imply small and not Copy does not imply large. E.g. a Box is not copy, but is the same size as usize (but references more data on the heap), while [u8; 1000] is Copy but is a thousand bytes large.

Is the only way to get to the value contained within an enum to use a match statement?

In some cases the if let construction is appropriate, but in this case you want a match. If you want to extract a single field that exists in each variant, a helper function with a match can be very nice.

I don't understand why sometimes dereferencing happens automatically and sometimes i need to do it manually. Can anyone explain this to me?

In general dereferencing happens automatically only in these cases:

  1. Calling a method on the type. doc
  2. Destructing a value into a pattern. doc

... unless I forgot something. You may also want to read about the Deref trait.

Why did I need to add a label to my loop? I was not able to figure out what 'continue' means in the context of a 'match' statement (at least with a quick google). Does it mean the equivalent of a "fall-thru" switch case?

Continue does not interact with a match. You needed the label to get through the for loop :slight_smile:

I can't create a method to make a new KeyValuePair from a key and value. I would like to be able to do this so that I have a concise way of creating KeyValuePairs. I wanted it to be a method so that I can eventually make BPTree generic, allowing the method to automatically create the right generic KeyValuePair.

Sure you can! The idiomatic place to put it would be here:

impl KeyValuePair {
    pub fn new(key: u32, value: u32) {
        KeyValuePair {
            key: key,
            value: Box::new(val),
        }
    }
}

You can now call it as KeyValuePair::new(key, value). In general you should do this with all of your constructors to avoid also borrowing self when calling them.

I am missing a case here. If the key to be inserted is higher than any existing keys nothing will get inserted. In C, I would create the variable i before the loop and check if i > kvs.len() after the loop. In python I would use a for...else statement. I'm not sure what the idiomatic thing to do is here.

I'm not sure there is an idiomatic thing. Comparing the length and adding an else to your if would probably be fine.

In any other language I would structure this differently. I would have an if statement at the top of my code to guarantee that root is allocated. Then I would run the normal insertion code that is currently in the Some case. Instead it seems I need to duplicate the insertion code or add unnecessary None checks.

The challenge is convincing Rust that you really are dealing with a Some after the if. You could just unwrap it, which would be fine since it actually would be guaranteed to be Some in that case, but it's not very pretty. Ultimately you don't need the option as I noted in the start, so just remove it.

Sometimes you can structure your code something like this:

let reference_to_inside_some = match &optional_var {
    Some(val) => val,
    None => {
        optional_var = Some(initialize_me());
        optional_var.as_ref().unwrap()
    },
};
// now do code using the reference above.

Unfortunately support for this is sometimes spotty. The best example of this pattern is the entry api for the hash map.

Why don't I need ref mut on the leaf variable?

The difference between this:

if let Node::Leaf(leaf) = root.as_mut() {

and this

match self.root {

is that root.as_mut() has the type &mut Node, while self.root has the type Option<Node>.

In one case the value you're destructing into a pattern is a reference, which is automatically dereferenced. When destructing a reference type, all fields in the pattern are automatically references.

In the other case, you're destructing a value. In this case the default is to take the value by value, which would result in the fields containing values. You are changing this default by using ref mut, but you could also have matched on &mut self.root in which case you'd no longer need the ref mut in the Some case.

I just made sure root was a leaf, but I have no way to get a ref to the inner struct without a match or an if let statement. I find myself writing a lot of else panic statements like this one because I want to make sure all of the cases of the if are handled.

One important thing to understand about Rust, is that it only reasons locally. In particular, when calling a function, it only looks at the signature. In this case, you're returning a Box<Node>, so the compiler would have to look inside the function to figure out that it's definitely a leaf, which it will not do.

In this case I would probably try to restructure my code to work directly on the LeafNode type and only wrap it in a Node when you reach the self.root = Some(root); line. In this case, the type system guarantees you're dealing with a leaf node, so you don't need a panic.

Do I need to write another version of this function for fetching a mutable reference? Wouldn't that function be almost entirely copy-pasted? That seems like a lot of code duplication / boilerplate.

Unfortunately yes. The BTree in the standard library has a shared function returning a raw pointer, but you can't do this with references, because:

  1. If the shared function returns a mutable reference, it would take self by mutable reference, so you can't call it if you only have an immutable reference to self.
  2. If the shared function returns an immutable reference, you wouldn't be able to use it to return a mutable reference.

With raw pointers, the needed conversion is possible.

This kind of hurts my brain. So this is a MUTABLE variable
named node that refers to an immutable reference to a
value inside a box? Do I have that right? If I take away the
mut keyword I get compilation errors when setting node
later in the code.

let mut node = node_box.as_ref();

Yep you got it right. The gist is that since node is mutable, you can change which node it points to, but as the reference is immutable, you can't change the node itself.

Compilation error: can't borrow self twice. I don't understand why this counts as borrowing twice. From my perspective i'm trying to pass my mutable, borrowed reference into the function.

Method calls are syntax sugar for, in this case, this:

BPTree::split_leaf(&mut self, leaf);

Basically you are trying to pass two overlapping mutable references to the split_leaf function. In this case the issue is that leaf overlaps with self, which is not allowed. This is detected by the compiler by the fact that self was borrowed, and then the node_box was borrowed and so on until you reach leaf.

This is the kind of thing that makes implementing a B+ tree difficult, but in this case it doesn't seem like you actually need to access self, so make a free-standing function that only takes leaf instead. Unfortunately, this is not the end of the problems, as you have noted:

I got stuck in a few ways here. Basically, this function is supposed to split a leaf node that is overfull into 2 half-full leaves and then add both to the parent layer. If this insertion results in an interior node being overfull, that node is also split and this process continues up the entire tree until either an underfull node is found or we hit the root node. If we hit the root, we split it and create a new root level of the BPTree. My issues are as follows:

  1. As I stated below in insert(), I cannot call this function at its current place in the code due to an error about borrowing self twice.
  2. I'm not sure how to do parent pointers idiomatically. Should I create raw pointers and drop into unsafe blocks? Should I use Rc::weak pointers? Can I use these without using a regular Rc pointer?
  3. Ideally, in C I would create a function to add a node as a child of another to ensure that every time I add a child to a node the parent pointer also gets updated. What does the ownership model look like here? Would the parent and child both have mutable references to each other? Is that not allowed?
  4. Due to the other issues, I have not really gotten close to implementing this method and I have no idea what other issues I might hit.

Adding parent pointers would definitely break the ownership model. It's true that you could in principle do it with weak Rcs, but that's definitely not ideal, and no, you can't use those without a corresponding regular Rc.

But yes, the issue is that you need access to the parent somehow. The obvious way to get would be to not throw it away in insert, but I expect that approach to leave you with crazy borrow problems when implementing insert, because now you have both a parent and a child and want to use both at the same time.

The real BTree solves this by using raw pointers, because then there's no problem with just keeping the parent and child pointer around. It doesn't add pointers back up to the parent, it just doesn't throw the parent pointer away when descending.

Maybe I could come up with a different approach that would work, but it's not obvious how to avoid these problems in safe code.

1 Like

@alice Thank you so much for the detailed response. I want to make sure I understand everything before I ask anything else so I will reply later today after I've had the chance to go through it all.

This may be a major spoiler, but have you read Alexis Beingessner's Rust Collections Case Study: BTreeMap?

1 Like

Right. I just meant that they could be squeezed into a single block of memory, which is the point of a B+ tree.

I'll look into that a bit. The doc seems to imply that there are more cases where this will happen. Maybe I can find a definitive list somewhere. I will comment here if I find it.

Ah. Whoopsie.

Ok. So what I was trying to do before was make a private method BPTree::new_leaf(). I see that this could also be done as you stated, but for the sake of argument lets pretend I had a good reason to make it a method of BPTree as I originally intended. When I tried to do this, I got an error when I called the function at that same place in the code. The error stated that I had already mutably borrowed self (the BPTree struct) and so I could not immutably borrow it again to pass it into the method. Do you know how I could circumvent this?

My issue with unwrap() is that it takes the value from the Option instead of borrowing it. Is there a version of unwrap() that just grabs a reference? I couldn't find one, but maybe I just missed it. Is that what the .as_ref().unwrap() does in your snippet? I thought that would just give me a &Option<T>.

Ah ok. That clears a lot up. Thank you.

That makes a lot of sense.

Ok. That is a shame.

So to clarify, the compiler is complaining because the mutable borrow on leaf is an extension of the mutable borrow on self? And therefore I am effectively mutably borrowing self twice? That kinda makes sense, but off the top of my head I don't see why the compiler would be upset abut that. I don't see how that could be considered dangerous.

Ok. Maybe I should try my implementation again with raw pointers to simplify some of these problems. Up to this point I have been trying to avoid that because it seems like so much of the power of rust comes from its safety guarantees. In my head my thought process was kind of "if i'm going to be writing unsafe code in rust, why would I not just use C instead". But maybe if I can keep the unsafe areas to a minimum it will still be worth it.

Thank you so much for the response.

I hadn't seen that yet. I will read it tomorrow. Thank you.

This is a good point, because it's not necessarily obvious why the compiler prevents it. It is fundamentally a design choice that was chosen by Rust. It may be more useful to think about references in this way:

  1. An immutable reference is actually a shared reference.
  2. A mutable reference is actually a unique reference.

This description of pointers is closer to the truth. The fact that one is usually immutable and the other is usually mutable then just follows from the rule that Rust does not have data races.

That a mutable reference must be unique is usually expressed in this way: "Mutable references may not alias with any other reference in the program". Aliasing basically means overlapping, but there is one caveat: You can borrow a unique reference and create a "sub-reference", and this is not considered aliasing even if they overlap because the borrowed reference is not usable while the borrow exists. The nomicon has a chapter on aliasing.

Okay, so, why does a reference being shared imply that it must be immutable? Basically the problem is that if you assume it could be modified through the reference, then if you send it to another thread, you could use your modification super powers to generate a data race by modifying it from two threads at once.

Your code is not multi-threaded? Great, then you get to modify it as long as you prove to the compiler that there's only one thread. You do this by wrapping it in Cell or RefCell or similar. Similarly you can use a Mutex to prove that only one of the many shared references are modifying it at once. Or you can use an atomic integer type to prove that you're using an atomic cpu instruction that doesn't cause data races. The point is that shared references are not always immutable: as long as you convince the compiler that no data races happen, you are allowed to modify through the immutable reference.

Analogously, a unique reference is unique, so if you have such a reference, modifying it is never a data race. You can even access the contents of a mutex without locking it if you have a mutable reference to the mutex using Mutex::get_mut, and this is okay because the compiler guarantees nothing else could try to lock it while the mutable reference exists.

As an example of how mutable references sometimes do not allow mutation: If you have a variable stored behind an Rc, then the variable behind the Rc may still be shared even if you have a mutable reference to the pointer. This is of course because there may be other Rcs, and the uniqueness of the reference only guarantees that there are no other references to the same Rc, but other Rc objects may point to the same item. Thus Rc only allows you a shared reference to the contents, even if you have a mutable reference to the Rc. (of course, you can still change which Rc it points to, you just can't change the object behind the smart pointer) This is also basically the situation you had at that location where you said "This kind of hurts my brain": The situation was basically &mut &T, and this is very similar to &mut Rc<T>.

I hope that makes sense?

Note that raw pointers have no aliasing guarantees, so unsafe code is free to create aliasing mutable raw pointers. On the other hand, raw pointers can alias with unique references, so try to avoid using both.

You can not. The shared reference to self aliases with a unique reference found elsewhere in the calling function, so you are not allowed. You can circumvent this by not unnecessarily borrowing self.

Yes, Option::as_ref converts from &Option<T> to Option<&T>.

Well the advantage of writing unsafe code in Rust is that afterwards you can use it from safe Rust code :slight_smile:

If you're going to try the unsafe route, you should check out types such as NonNull that guarantee that they never contain a null pointer (although it may be some non-null dangling pointer). You should definitely also check out the nomicon, as it provides a lot of information regarding how to write unsafe code. The article posted by @TomP is also a good option.

The article Rust: A unique perspective by @mbrubeck is also a good read.

1 Like

Ok. That makes more sense. So the returned value is a NEW Option which is either None or contains the reference. I think I get it now.

Thank you so much for your help. I'll give it another shot and see how far I can get.

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.