Complex hierarchy with parent/sibling pointers

Hi all, I have a scenario where I have a set of structs that form a tree, e.g.:

struct A {
  // data fields ...
  bs: Vec<B>,
}
struct B {
  // data fields ...
  a: // pointer to A
  siblings: Vec</*pointer to B*/>
  cs: Vec<C>,
}
struct C {
  // data fields ...
  b: // pointer to B
  ds: Vec<D>,
}
struct D {
  // data fields ...
  c: // pointer to C
  es: Vec<E>
}

A few conditions:

  1. There is only ever one root A.
  2. "Parent" pointers are never modified after the child is created.
  3. New children can be added to a struct at any time.
  4. Some structs need sibling pointers.
  5. The lifetimes of all objects under an A never outlive A.

I've tried a few solutions, but nothing sticks. Given that these relations are standard in C++, I'm hoping the community can guide me here. Some ideas:

  1. Arena I have a single arena containing all structs and pass this to A, B, C etc as Rc<RefCell<Arena>>. This allows any struct in the hierarchy to effectively modify the tree (e.g., a B can create and add new Cs to itself). However, who owns the root (A)? A needs the arena, too, but the arena only has a single A, so do we wrap it in an Option and set it after creating the A, like:
struct Arena {
  a: Option<A>,
  b: Vec<B>,
  // rest ...
}
  1. Store a weak RefCell self-link in each element and pass this to children for parent pointers. This works, but it does mean I need to always use weak_link.upgrade().unwrap().borrow() and rely on runtime checks for multiple borrows. I'd like to avoid this.
  2. Pass along parent references to children functions when need be. This get's very unwieldy because each struct is fairly complex.
  3. Use an arena, but store data and functionality separately, like:
struct Arena { a: Vec<AData> }
struct AData { ... }
struct ARef<'a> {
  arena: &'a Arena,
  a_idx: usize
}

But then I need an ARef and an ARefMut depending on what I want to do and duplicate methods across these types. This seems clunky.

Thoughts from the community on the proper solution to what seems like a simple design in other languages?

Well, I think the consensus would be: thanks for showing us problems with other languages. We know about these.

The trouble with your data structure is in it's definition: it looks as if you want to have your cake and eat it too.

Indeed they are common in C++ and thus are endless source of bugs. The problem lies with the fact that structure, as described, is intrinsically dangerous.

How do you plan to guarantee that? Before you start designing data structure in Rust you have to answer that question: how can I make it safe? Soundness pledge and all that.

C++ does not care: if you misuse that data structure it's your own fault.

Managed languages say: we would ensure that memory would be removed when there are no more references but other than that do whatever you want.

Rust does care thus offers many ugly ways to achieve what you want. There are no easy way because what you want is inherently dangerous.

You may use weak references and runtime checks. Or no parents and pass them to functions when you descend. Or even put everything in one vector and use indexes.

But you have to design API which you would use to deal with that thing first, then you can understand whay would be the least ugly approach in this particular case.

There are no one, single, “proper” solution because such data structures are used to solve many disparate problems in C++ and other languages where they are easy.

You have to step back, look on higher-level problem that you are trying to solve and then go from there.

P.S. It's similar to how people with dynamic language backgrounds often ask why C++ structs don't have iterators and make it impossible to add new fields at runtime. What do you tell them?

1 Like

What I have done in my code is to store identifiers as "pointers":

struct Tree {
    a: A,
    bs: Vec<B>,
    cs: Vec<C>,
    ds: Vec<D>,
    es: Vec<E>,
}

struct A {
    bs: Vec<BId>,
}

struct B {
    siblings: Vec<BId>,
    cs: Vec<CId>,
}

struct C {
    b: BId,
    ds: Vec<DId>,
}

struct D {
    c: CId,
    es: Vec<EId>,
}

struct E {}

struct BId(usize);
struct CId(usize);
struct DId(usize);
struct EId(usize);

This way you have Tree owning all the data, and you can get the appropriate nodes using the identifiers, for instance:

impl Tree {
    fn get_b(&self, id: BId) -> &B {
        &self.bs[id.0]
    }
}

I've tried this, but how would does a B add a new C child to itself? B would need to have a tree: Rc<RefCell<Tree>> member. Is this idiomatic Rust?

This approach doesn't require using Rc or RefCell. Make new_c a Tree method:

impl Tree {
    fn new_c(&mut self, parent: BId) -> CId {
        self.cs.push(C {
            b: parent,
            ds: Vec::new(),
        });
        
        let c_id = CId(self.cs.len() - 1);
        self.bs[parent.0].cs.push(c_id);
        c_id
    }
}

But, then all tree modifications need to go through Tree. It is possible that in some function B::func(&self, ...) I may need to add a new C.

Why is this a problem?

This seems natural if you actually need these parent pointers. Parent pointers imply that each node effectively has knowledge of the whole tree, so it makes sense to have exclusive access to Tree when doing any modifications to the structure.

To do that would mean that B has mutable access to the Tree, i.e., that Tree is a member of B, no? If the new C is being added deep in some function B::func(&self, ...)

I would say that func should probably also be a Tree::func(&mut self, b: BId) rather than a B::func(&self).

It would be easier to discuss this if we had a realistic scenario rather than abstract A, B, C, func, etc.

1 Like

In my case the entire system is event-driven. A receives external events that it or one of its subcomponents (B, C, etc) react to. One reaction is to create more subcomponents, e.g., B might create a new C and C might create a new D. Even pulling this into a tree doesn't seem to work:

impl Tree {
  fn react_b(&mut self, event: Event, b: &mut B) {
    if cond(event) {
      let c_id = self.cs.len();
      self.cs.push(C::new(c_id, ...))
      b.cs.push(c_id);
      let new_c = &mut self.cs[c_id];
      self.react_c(event, new_c); // <-- error here due to multiple mutable borrows of *self
  }
}

This line implies that b is not part of self because &mut is a unique (exclusive) reference. It's probably not what you want. Do this instead:

fn react_b(&mut self, event: Event, b_id: BId) {

Same here, that's why you get an error. You're trying to uniquely borrow the whole Tree and a C that is part of the tree at the same time. Instead, this will work:

    self.react_c(event, c_id);
2 Likes

So that means I can never store a local let c = &mut self.cs[c_id] in any calls later and must always use the index?

Thanks to "reborrowing" this line of code is fine.

What you can't have is a function that takes both a &mut Tree and a &mut C at the same time.

There is a good reason for this. Imagine that your react_c method creates a new C in the tree. Vec<C> might reallocate all its memory, and the C reference would still be pointing to the outdated location, creating a dangling reference. Rust protects you from this risk.

Identifiers would remain valid in this scenario, which is why storing and passing identifiers rather than references is the right thing to do.

As is usually the case with these kinds of questions, the answer is, "it depends".

While it is not possible to have aliasing exclusive borrows, it is possible to split exclusive borrows. For instance, slice::split_at_mut():

pub fn split_at_mut(&mut self, mid: usize) -> (&mut [T], &mut [T])

The input &mut self and the two &mut [T] return values have the same lifetime. This means that you cannot use &self until the two return values are dropped, but you can use either of the returned exclusive borrows anywhere you like. They don't overlap, no aliasing, no problem!

This is also a form of reborrowing. And assuming you construct your API in such a way that you can "split" your &mut self without aliasing when you need it, then it's just fine to use references on the stack instead of indexing and re-indexing everywhere.

On the other hand, if you are trying to use let c1 = &mut self.cs[0] and let c2 = &mut self.cs[1] and then use both at the same time, that won't work. In a case like this, use slice::split_at_mut() or similar to split the borrow of self.cs.

1 Like

Given that this is a psuedo-event-driven hierarchy a couple questions come up:

  1. Who is responsible for pushing through the changes through the tree? Is it Tree?
  2. Since this all becomes one giant buildable unit, the interface isn't super clear. Any recommendations for how to structure the code, if it all goes into Tree?
  3. How about a Handler trait that applies changes to the tree based on the type of event? The interface might be:
trait EventHandler {
  fn handle(&mut self, event: Event, tree: &mut Tree) -> Result<bool>;
}

Then, all members of tree must be public. Is this idiomatic Rust? Something like this also reminds of an ECS-ish system where systems mutate an entire World?
4. Often times, it makes sense for methods to be a part of the individual component. For instance, to check if something is true in A, we might need to go all the way to C. In this case, is it wise to pass along the tree itself?

impl A {
  fn check(&self, tree: &Tree) -> bool {
    self.bs.iter().all(|b_id| tree.bs[b_id].check(tree)
  }
}
impl B {
  fn check_b(&self, tree: &Tree) -> bool {
    self.cs.iter.any(|c_id| tree.cs[c_id].check(tree);
  }
}

This implies that tree components can never take a &mut Tree and only has read-only methods.

The following quote should probably be reiterated right off the bat:

It's still unclear what your actual goal is. You are building some kind of data structure that seems to resemble a trie, but there is some requirement for event processing that isn't explained at all. And it isn't immediately obvious why you would want a trie for event processing.

It does appear that you are approaching a mutate-the-world design with this thing, yes. In which case, an actual ECS will be beneficial because it can hide a lot complexity and remove foot-guns from the API surface area.

Remember that &T does not mean the data behind the reference is immutable! It means the reference is shared. That's a critical distinction to understand. You may have some interior mutability behind that reference, but (because it is shared) any mutation needs to be checked at runtime for any aliasing rule violations. (Generally that means it's a tradeoff between compile-time checking and runtime checking.)

In other words, there is no inherent implication that you need &mut T.

Still, I think it would be helpful to have a full understanding of what you are trying to accomplish before anyone would feel comfortable giving you a confident answer like "yeah, ECS is what you want."

It isn't a trie, it's a tree with parent and sibling "pointers". In my specific instance, it's modeling a database catalog, namespace -> schema -> table -> column -> attributes. This naturally forms a tree, but there are cases where the table needs to know which schema it is in.

The system itself isn't all too complicated: there is a Tokio task receiving events from a channel related to mutations to apply to the catalog. Each event requires some processing related to this tree of objects. For instance, some events update the namespace along, but some require propagating down the tree, not always to the leaves. Some events pertain to a specific component in the middle of the tree and must bubble up.

IOW: not only that's not a tree, but, in fact, we have classic data structure where you don't want to have pointers to parent.

No. They form a hierarchy and certain operation involve two levels.

And we have two possibilities:

  • Either hierarchy is just wrong and schema doesn't need to know about table (unlikely).
  • Or higherarchy is correct and then the idea that table needs to be able to change something about schema behind shema's back is wrong.

IOW: you have invented a very clever data structure which is nicely designed to eventually cause a deadlock. And Rust tried to stop you from making a mistake.

As for actual solution… there are no silver bullet. If you read the Wikipedia article it outlines few different solutions, but they all involve various manipulations with locks ordering and indexes.

Rust offers few different solutions, but since we are dealing with concurrency they would all involve locks of some form.

There are many of them in Rust:

  • Complie-time locks, &mut
  • Single-threaded locks, RefCell
  • Normal, synchronous Mutex
  • Tokio's asynchronous Mutex

And there are more.

Rust gives you plenty of rifles if you really want to blow your kneecaps badly enough. Pick your poison.

But that needs to be a conscious choice and the default option is to ensure that kneecaps wouldn't be blown out.

IOW: you are writing muti-threaded program (async is cooperative multithreading) and are trying to do that without proper locking strategy. It wouldn't work, sorry.

It wouldn't even work in C++, really, only Rust turns run-time errors (which you probably planned “to think about later”) into compile-time errors (thus you need to think about them now).

I think using Arc's and Tokio's locks may be the best compromise at the exploration stage: yes, there would be deadlocks and code wouldn't be efficient but it's almost as easy to write as equivalent C++ code and source of all issues with inefficiency and bugs are immediately and painfully visible.

You may look of what you wrote and refactor it easily. Deadlocks are implicit in your solution as currently envisioned thus you would need to think about them later, anyway.

P.S. Now can you see why people spent week and half trying to get some information from you to prevent XY problem? That's from experience: 9 times out of 10 when someone arrives on URLO that's what happens. You are investing in some wrong solution using your habits from other languages, become stuck and then it's impossible to help you because you are in the deliberate dead-end and the proper solution is to do couple of steps back.

To clarify (1) the tree is owned and updated by a single thread using messages passed through a channel (Tokio is an implementation detail, IMHO, but if you really feel strongly, replace Tokio with a background thread+channel) and (2) in the theoretical sense, this is more a graph rather than a tree. There is no concurrent modification to the tree and a deadlock is not possible (at least not a "runtime" deadlock which you differentiated).

What we're really dealing with is mutable access to the same data structure (i.e., the tree) at multiple levels of the tree. An example of a root->leaf manipulation is when creating a new schema, several default tables have to be created each with default columns. An example of a leaf->root traversal might be when deleting a column, we need to check if it belongs to a "special" table, or if the table belongs to a "special" schema. Each of those operations need access to one or two levels higher in the hierarchy. How is this typically handled in Rust? The one option is the tree/arena idea, but the tree structure becomes a ball-of-mud containing logic for all structs.

TL;DR: Rust believes in the invariants. You have to either accept that or pick the other language, ultimately.

How do you plan to prove that your structure wouldn't end up in the inconsistent state?

This depends on your answer to the previous question, of course.

Which it already is judging from your definition.

The most basic idea behind Rust is actually pretty simple: someone have to guarantee safety and ensure that data structures would be consistent.

That someone may be compiler (normal “safe” Rust) or a programmer (unsafe Rust). Depending on your choice there are different approaches.

But default approach which is used in most other languages “here's my data structure, I have no idea how it'll work, but let me write the code already” is not an option.

The usual approach for hairy cases in “safe” Rust are Rc/RefCell, Arc/Mutex (depending on whether you'll need thread-safety).

In the unsafe Rust you have piointers (but then it's your responsibility to think about safety). There are сrates which deal with graphs, which encapsulate some graph-related algorithms.

That's true only if people “outside” couldn't keep references to members of your structure. And then the decision becomes obvious, isn't it?

And if they can keep references to bits and pieces of your structure then you can not say “it's one, single, wholesome structure”. Outside users become temporary members of your data structure.

It's similar to how HashMap or Vector are behaving in Rust.