Recommendation for mapping flattened related objects into hierarchies

I've migrated the following javascript algorithm for mapping lists of flattened objects into their hierarchical form: javascript - Nest flat objects based on a parentId and id property - Stack Overflow

I've used samply and the Vec allocations are 98% of the CPU time:

<alloc::vec::Vec as alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter [nester]

18GB of ram isn't enough to process around 30,000 nodes.

Are there any recommended third-party solutions for these kinds of transformations? Left to my own devices, I'd imagine maybe a non-recursive implementation would gain me the biggest improvement, but that's a lot more to code so I thought I'd check here first.


pub struct TreeItem<'a> {
    pub parent: Option<&'a str>,
    pub identifier: &'a str,
    pub children: Vec<TreeItem<'a>>
}

pub fn build_tree<'a>(
    id: &'a str,
    pid: &'a str,
    store: &FxHashMap<&str, TreeItem<'a>>,
    rels: &FxHashMap<&str, Vec<&'a str>>,
) -> TreeItem<'a> {

    let mut data = store.get(id).unwrap().clone();

    if let Some(children) = rels.get(id) {
        data.children = children
            .into_iter()
            .map(|c| build_tree(c, id, store, rels))
            .collect();
    }

    if data.parent.is_some() {
        data.parent = Some(pid);
    }

    data
}

I'm guessing that this cloning is not happening in the JS version, and instead references are being shared. If so, use an Rc or Arc to do the sharing in the Rust version.

1 Like

This function is not just “Vec allocations” — besides allocating the vector, it is also responsible for running the iterator being collected, which in your case includes calling build_tree() via the Iterator::map() adapter. So, time spent in this function is time spent running your entire recursive algorithm.

1 Like

So I had to delete my earlier posts. It turns out the runtime improvement was due to unused variables causing the performance bottleneck never to even run.

Furthermore, there is a bug in using borrow_mut recursively:

@jumpnbrownweasel do you have any suggestions for how to mutably borrow for this particular case, given its recursive?

@jumpnbrownweasel here's an update on the current code attempt (thanks for offering to look):


pub struct TreeItem<'a> {
    pub parent: Option<&'a str>,
    pub identifier: &'a str,
    pub children: Vec<Rc<RefCell<TreeItem<'a>>>>,
}

pub fn build_tree<'a>(
    id: &'a str,
    pid: &'a str,
    store: Rc<RefCell<FxHashMap<&str, TreeItem<'a>>>>,
    rels: Rc<RefCell<FxHashMap<&str, Vec<&'a str>>>>,
) -> Rc<RefCell<TreeItem<'a>>> {
    let mut borrow_store = store.borrow_mut();
    let data_inner = borrow_store.get_mut(id).unwrap();
    let data = &mut Rc::new(RefCell::new(data_inner));

    if let Some(children) = rels.borrow().get(id) {

        let children_to_add: Vec<Rc<RefCell<TreeItem<'a>>>> = children
            .into_iter()
            .map(|c| {
                build_tree(c, id, Rc::clone(&store), Rc::clone(&rels))
            })
            .collect::<Vec<_>>();

        data.borrow_mut().children = children_to_add;
    }

    if data.borrow().parent.is_some() {
        data.borrow_mut().parent = Some(pid);
    }

    Rc::clone(&data)
}

Please post your current code so we can work with that, rather than with an outdated version (this often results in an answer that no longer applies).

But in general you may be able to pass the Rc down, rather than passing down the borrowed value, when you recurse.

1 Like

Thank you and sorry for the churn. This is the most complex Rust I've waded through. Updated my previous comment with the state of the code.

1 Like

Is your build_tree intended to be a combination of the two steps (doing them at the same time) that are done one at a time in the JS version?:

  • Populate store, rels and roots from the input data array.
  • Use that info to populate the children field.

Theoretically there's maybe a rain-drops chance they could be done at the same time, but that's a whole 'nother can of worms I assume wouldn't really be feasible.

The reason is because I parse these flattened trees from a Lalrpop program where there's no guaranteed order. As such, I can't possibly know all of the children for a given node until I've traversed the entire directory of files accumulating them by parent/child keys.

Example file (trivialized):

parent1
  has_child_1
  has_child_2


child3
  related_to_parent_1   // these relationships count hierarchically the same as child1/child2
  has_child_4
    has_child_5

yields:

parent1
  child1
  child2
  child3
    child4
    child5

That's what I would have thought, but I don't see your code for doing the first step, that's why I asked. I'm just trying to understand that method.

The error you're getting is on this line:

            Rc::clone(&data)
         Rc::clone(&data)
         --------- ^^^^^ expected `&Rc<RefCell<TreeItem<'_>>>`, found `&Rc<RefCell<&mut TreeItem<'_>>>`

So data contains a &mut TreeItem reference. According to the return value type, it should contain a TreeItem value. That's what causes the error.

Since we want to share the TreeItems, we need to store an Rc<RefCell<TreeItem>> everywhere we store a TreeItem, including in the store map. That way we can just increment its ref count (by cloning it) when we get it out of store, modify it to populate its children, and return it.

However, we don't need an Rc<RefCell<...>> around the two maps. These would be prepopulated by the first step, I assume, and here we only need read-only access to them. With read-only access to the store map, we can still get a mutable ref to the TreeItem via its container RefCell. Therefore, we can pass the shared map refs down when recursing.

Here is modified code that compiles for me, although I haven't tried to verify that it works as intended.

pub struct TreeItem<'a> {
    pub parent: Option<&'a str>,
    pub identifier: &'a str,
    pub children: Vec<Rc<RefCell<TreeItem<'a>>>>,
}

pub fn build_tree<'a>(
    id: &'a str,
    pid: &'a str,
    store: &FxHashMap<&str, Rc<RefCell<TreeItem<'a>>>>,
    rels: &FxHashMap<&str, Vec<&'a str>>,
) -> Rc<RefCell<TreeItem<'a>>> {
    let data = Rc::clone(store.get(id).unwrap());

    if let Some(children) = rels.get(id) {
        let children_to_add: Vec<Rc<RefCell<TreeItem<'a>>>> =
            children.iter().map(|c| build_tree(c, id, store, rels)).collect::<Vec<_>>();

        data.borrow_mut().children = children_to_add;
    }

    if data.borrow().parent.is_some() {
        data.borrow_mut().parent = Some(pid);
    }

    data
}

I question whether the &'a str lifetimes are correct, but that's a separate concern and maybe we can talk about it later. It depends on where the owned Strings will live, and how you'll manage that owner.

PS. Also note that I used iter instead of into_iter. Clippy gave me a warning when using into_iter because there is no need to consume a container of refs. See clippy into_iter_on_ref.

BTW, what's the purpose of the code above? It looks like it does nothing, since the parent field already contains the parent id.

it's my interpretation of this comment's suggestion, which seems to be right:

There is a bug where Browser:Chrome gets output thrice but in all cases its parentId is Version:XP (i.e. Catalina, Mojave are missing). This can be mitigated by overwriting the parentId, but the root of the issue is store overwrites, keeping the last.

oof, I was close. Thank you!

However, this is a bit more of a breaking change for me because I later recurse (again) like so:

pub fn draw<'a>(items: &'a [Rc<RefCell<TreeItem<'a>>>]) {
  for item in items {
    // do stuff...
    let x = item.borrow().children;
    draw(&x)
  
  }

}

which leads to:

error[E0597]: `x` does not live long enough
   --> src/draw.rs:112:29
    |
35  | pub fn draw<'a>(
    |             --- lifetime `'a` defined here
...
110 |         let x = item.borrow().children;
    |             - binding `x` declared here
111 |
112 |             draw(&x);
    |             -----^^----------
    |             |               |
    |             |               borrowed value does not live long enough
    |             argument requires that `x` is borrowed for `'a`
113 |         }
    |         - `x` dropped here while still borrowed

Before, children I could simply reference with the same lifetime like &item.children. So how can I still eat that cake?

That's an unrelated problem. Type &'a X<'a> is an antipattern. Except in unusual situations (e.g., using an allocation arena), never use the same lifetime for a reference and for the item referenced. Use separate lifetimes. Below the elided lifetime '_ will be assigned a fresh/different lifetime. (No need to assign lifetime names here, they just need to be different.)

pub fn draw(items: &[Rc<RefCell<TreeItem<'_>>>]) {
    for item in items {
        // do stuff...
        let x = &item.borrow().children;
        draw(x)
    }
}

In addition the following was moving the Vec out of the children field. Instead (as above) obtain a reference to the Vec.

    let x = item.borrow().children; // <<< added `&` here

PS. The antipattern I mentioned above is severe when a &mut reference is used. When a & shared reference is used it is less severe but was still causing the problem you ran into. The shared version of the problem is linked from the doc I mentioned.

1 Like

arrgh! I knew that... Honestly I gotta write some clippy rules to highlight every single one of those idioms - sooo many hours lost due to simply forgetting.

So congratulations - you've officially saved another soul from spelunking way beyond their understanding of the Rust language. Mutating a borrowed value recursively is a terrifying software pattern I now know to avoid like the plague.

runtime (for real) has gone from 7.9 seconds down to 1.5 seconds.

89% of the runtime is now:

<alloc::vec::Vec as core::iter::traits::collect::FromIterator>::from_iter

Does that mean rearchitecting this as non-recursive would be the next biggest bang (avoiding collect)? If so, are there recursive-to-nonrecursive conversion frameworks recommended?

Or... my other thought is I could try to pass around Iterator<Item = Rc<RefCell<TreeItem...>>>?

:joy:

I think the way those stats work is that they count the time for all the nested calls. The from_iter call is the driver of the chain of iterator adapters, including the map that calls build_tree. So this high number includes all the nested calls to build_tree and it makes sense this is the majority of the time.

I'd look for a leaf call (one that does not recurse) that takes more time than the other leaf calls.

This won't reduce the time a huge amount, but one thing you could try is using smallvec instead of Vec. If you expect that the number of children will usually be 10 or less, for example, you can avoid an allocation when this is true by creating a SmallVec::<[_; 10]>. It will allocate only when the number of children is greater than 10. There is wasted space when the number is less than 10 because the base array is fixed size, but this is probably a smaller cost than the allocation. And when the number is greater than 10, the allocation cost is probably warranted. But to know whether it helps you just have to try it. Luckily SmallVec has mostly the same API as Vec, so it's easy to try.

1 Like

One new feature I want to add is the ability to mutate item (not just borrow it). As in:

item.borrow_mut().identifier = some_new_text;

this compiles, however when run I get:

Already borrowed: BorrowMutError

I don't understand why this wouldn't get caught at compile time?

Also, is there some magical way to mutate without cloning? Cloning works of course:

let item_clone = Rc::new(RefCell::new(TreeItem::new(
                item.borrow().parent.clone(),
                some_new_text,
                // take(&mut item.borrow_mut().children),
                item.borrow().children.clone(),
            )));

Borrows using RefCell use runtime checks, not compile time checks. RefCell is used when compile time checks are too restrictive.

https://doc.rust-lang.org/std/cell/index.html#refcellt

RefCell<T> uses Rust’s lifetimes to implement “dynamic borrowing”, a process whereby one can claim temporary, exclusive, mutable access to the inner value. Borrows for RefCell<T>s are tracked at runtime, unlike Rust’s native reference types which are entirely tracked statically, at compile time.

Thanks! So is cloning the only option here? There's no way to mutate a single field without hitting this Already borrowed error?

I'm guessing I would need instead something like?:

pub struct TreeItem<'a> {
    pub parent: Option<&'a str>,
    pub identifier: &'a str,
    pub children: Rc<Vec<Rc<RefCell<TreeItem<'a>>>>>,
}

So that I avoid at least cloning children

To mutate via RefCell, there must be no other active borrows. One thing to try (short of changing the algorithm) is to look for borrows that are held longer than necessary. You can drop the guard to end/close the borrow.


Some examples of cases where you may be able to avoid a borrow that conflicts with another borrow are below. In general the idea is to borrow for the minimal time that is needed.

  • If you have a stored data structure containing a reference type that is borrowed, then change the type of that field from a reference type to a type containing the RefCell. Then borrow when you need to borrow, but no longer.

  • If you have a section of code where you borrow, but then you don't need the borrow for the rest of that section, then drop(...) the variable that is borrowing. This will clear the "borrowed" status of the RefCell, allowing other code to perform a borrow that would have conflicted. You can borrow it again later when you need it, including later in the same section of code.

  • Pay particular attention to mutable borrows, since they prevent all other borrows. OTOH, there can be multiple shared borrows that coexist.

Please feel free to ask for clarifications or further explanation on any of this. That's the purpose of this forum. What would help, however, is when you have an error, please provide the code that contains the error with enough of the surrounding code that we can compile it and see the same error that you're seeing.

  • One way is to link to a repo that contains the code.
  • Another way is to add the code to a playground and share it by clicking the share button. After you do that you can copy the URL from the browser's URL field, or click the icon next to "Permalink to the playground" to copy the URL, or right click on "Permalink to the playground" and pick "Copy Link" from the menu. Post the URL here.
  • If nothing else, it is Ok to post the full compilable code here if it is not huge. If the code uses other crates, also post the dependencies from your Cargo.toml.