Designing Node Types for B+Tree Implementation in Rust

Hi,

I'm relatively new to Rust (about six months of learning) and I'm trying to implement a B+Tree. I'm having difficulty finding a clean and idiomatic way to design my node types. Since B+Tree implementation is complex and most details are out of scope for this question, I'll first explain what I need to do, then show what I've tried so far and how I would approach it in an object-oriented language like C++, which is my background.

What I Need

I need two types: InternalNode and LeafNode. These types have very similar internal behavior:

  • Both have an ordered list of key/value pairs
  • Both can be searched by key
  • Both support insert and delete operations
  • (I'm omitting other common features to keep this simple)

However, there are also differences between them:

  • Both have fields (with getters/setters) specific to their type
  • Major difference: The types of keys and values differ
    • For LeafNode: both key and value types need to be generic
    • For InternalNode: the key must be generic, but values must be references (maybe Box<...>) to other nodes (either InternalNode or LeafNode)

I think this last requirement is what's causing me difficulties.

What I've Tried

I've thought of/tried many approaches, but nothing seems clean or idiomatic.

My most promising design was to have a NodeBase to implement the common functionality - the generic sorted key/value list:

// This is simplified code, implementations are omitted
struct NodeBase<K, V> {
    cells: Vec<Cell<K, V>>,
}

Then I have a Node enum with two variants for internal and leaf nodes:

type BoxedNode<K, V> = Box<Node<K, V>>;

enum Node<K, V> {
    InternalNode {
        base: NodeBase<K, Box<Node<K, V>>>,
    },
    LeafNode {
        base: NodeBase<K, V>,
    },
}

This seemed promising, but then I can't implement methods specific to InternalNode or LeafNode. In this scenario, I don't really have distinct InternalNode or LeafNode types.

How I Would Do This in a Classic OO Language

In a classic OO language, I would have a Node class that is inherited by LeafNode and InternalNode classes. The InternalNode class would hold some sort of pointer to a Node.

My Question

How can I design a set of types that would simulate or replace an inheritance relationship in a way that is both efficient and idiomatic in Rust?

Thanks for any guidance!

2 Likes

I think it's likely you'll end up with additional fields that are specific to leaves and internal nodes. You could define structs for InternalNode and LeafNode (contained in each enum variant) and then implement specific methods for each struct. For common data/methods, you can embed NodeBase in each of these structs.


When accustomed to OOP it may be very tempting to try to automatically "inherit" NodeBase methods from the InternalNode and LeafNode structs. While it is possible to partially simulate this, I strongly recommend not trying to do this, especially while learning Rust, since simulating OOP is not idiomatic or well supported in Rust.

A very simple solution (although it may go against your OOP instincts) is to add base(&self) -> &NodeBase and base_mut(&mut self) -> &mut NodeBase methods to each of these structs instead, and possibly implement a common trait containing these methods. I suggest at least trying this before trying to force Rust to fit OOP patterns.

4 Likes

Hi, thanks for your answer.

I want to make sure I understand your suggestion. What you are proposing is that I have InternalNode and LeafNode structs that conains a node base and they would both have public methods exposing their NodeBase. Is that right? Is it a problem, that we are kind of leaking the implementatation?

My reflex would be to have passtrough methods in InternalNode and LeafNode for the BaseNode method, but I think this is exactly what you said I should avoid, right?

Thanks again.

  1. Collections are the first thing C++ programmers naturally try (it's the first thing I tried) because that's how C++ is taught: from the ground up. It's easier to learn Rust by first getting comfortable writing application-level code, then drilling down.

    Implementing data structures efficiently often involves dealing with allocation and uninitialized memory. Sometimes unsafe code is necessary or improves performance. But these are advanced topics in Rust.

  2. The standard library's BTreeMap is, if I'm reading this right, a B+ tree, and the source code and comments are interesting.

  3. I might try something like this (playground):

    use smallvec::SmallVec;
    
    struct Table<K, V> {
        entries: SmallVec<[(K, V); 8]>,
    }
    
    struct InteriorNode<K, V> {
        leftmost: Node<K, V>, // for keys less than children.entries[0].0
        children: Table<K, Node<K, V>>,
    }
    
    struct LeafNode<K, V> {
        data: Table<K, V>,
    }
    
    enum Node<K, V> {
        Interior(Box<InteriorNode<K, V>>),
        Leaf(Box<LeafNode<K, V>>),
    }
    
    pub struct BTreeMap<K, V> {
        root: Node<K, V>,
    }
    

    This does not leak the implementation.

    I sort of agree with @jumpnbrownweasel that you might not find a lot of common functionality between interior nodes and leaf nodes. I didn't, except for binary_search_by_key which the standard library already provides.

4 Likes

Yes.

Leaking it to whom? Is this a user facing API? If you expect it to change, how might it be changed?

The base methods would need to be pub, as would NodeBase. But the only methods exposed on NodeBase would be those that are also pub -- it can have private methods as well.

Using a struct to provide a "shared interface" of a sort is really Ok, if you leave the OOP mindset behind. If you're used to programming everything with some sort of dynamic interface or base class then I understand this will seem unnatural. That can be done when necessary with Rust by defining traits for everything, or using a veneer API with forwarding. But I believe it is not normally done as a matter of course or "just in case", without a known need for it.

1 Like

I'm afraid you can't achieve both at the same time. For this kind of task, you should prioritize either efficiency or idiomatic design.

BTreeMap is one of the most complex and challenging parts of the Rust standard library. If you look at its source code, it's far from clean and relies heavily on unsafe tricks, essentially bypassing standard language patterns. However, the final implementation demonstrates fairly good performance for this kind of data structure.

How to organize graph data

The primary difficulty here is that B+/-Trees are referential and mutable structures, which are hard to model in a straightforward way using Rust's type system.

In this regard, I recommend flattening the ownership of the graph:

  1. Put all graph nodes into a single linear collection (e.g., a slab-like structure).
  2. Within each node, refer to other nodes (children, parents, siblings) using numeric indices into the collection.

This is a form of manual memory management, but it's likely no less efficient than using interior mutability containers (Cell, Rc, etc.). This approach is formally safe and sound, you just need to ensure you don't mess up the indices. In fact, debugging might be easier, since you can inspect the whole collection to see what's going on.

Later on, you can consider replacing this implementation with leaked box-es and raw pointers, similar to how it's done in the standard library's BTreeMap. But I wouldn't recommend starting with that. It's deep black magic that requires a strong understanding of the internals of Rust's borrowing system.

How to organize the type system

Your initial intuition about using an enum to represent different node types is on the right track. But you might benefit from separating the enum variants into standalone struct types:

enum Node {
    InternalNode(InternalNode),
    LeafNode(LeafNode),
}

struct InternalNode { /* ... */ }
struct LeafNode { /* ... */ }

With this separation, you can mimic inheritance using traits:

trait NodeAPI {
    fn foo(&self);

    fn bar(&self) {
        // Default ("super-class") implementation
    }
}

impl NodeAPI for Node {
    fn foo(&self) {
        // Delegate to the specific implementation
        match self {
            Self::InternalNode(node) => node.foo(),
            Self::LeafNode(node) => node.foo(),
        }
    }
}

impl NodeAPI for InternalNode {
    fn foo(&self) {
        // Overridden implementation
        self.bar(); // Call default ("super-class") implementation if needed
    }
}

impl NodeAPI for LeafNode {
    fn foo(&self) {
        // Another overridden implementation
    }
}

To further optimize this, note that in B+Trees you usually know the kind of a node based on its depth, which opens up two possibilities:

  1. Replace enum Node with a union Node. This is unsafe because you'll have to manually track whether a node is internal or a leaf. However, it avoids the runtime cost of matching enum variants.

  2. Split the node collection into two separate collections: one for internal nodes, and one for leaf nodes. This doesn't require unsafe code, but you'll need to choose which collection to index based on your knowledge of the node type.

Further Reading

  1. An article from the author of the standard library's BTreeMap.
  2. Another implementation based on slab allocation.
  3. My own implementation of B+Trees for text processing.
5 Likes

Hi Eliah, thanks for your answer!

I really like this idea. It will simplify things considerably, and it aligns perfectly with my long-term goals. My current goal is to build a proof of concept, but eventually I want to store the B+Tree data in a paginated file. This means I'll need to serialize the data structure in a sequential format anyway, so using numeric indices instead of ownership will be much more natural for that use case.

Thanks for pointing me in the right direction!

1 Like