D-way heap implementation ownership issue

Hi all,

I'm trying to some of the data structures from GitHub - mlarocca/AlgorithmsAndDataStructuresInAction: Advanced Data Structures Implementation in Rust.

At the moment while looking at d-way heap I've hit a snag around the bubbleUp and pushDown methods. Using bubbleUp as an example (see here), the method

  • Moves the desired element to the correct position (at the end after swapping the parents as necessary)
  • Also stores the desired element in a hashmap for performance gains when using a contains method

Looking at the std library's implementation of binary heap I've figured out how to move the element / parents position as needed (using the Hole struct) but I'm not entirely sure what the best way to represent what I understand to be a multiple ownership here.

Currently what I have is the following, using clone, full example can be found here:

    fn bubble_up(&mut self, index: usize) {
        let current = self.elements[index].clone();
        let mut bubble_up_index = index;
        let mut parent_index;
        while bubble_up_index > 0 {
            parent_index = self.get_parent_index(bubble_up_index);
            let parent = self.elements[parent_index].clone();
            if current > parent {
                self.element_positions.insert(parent.clone(), parent_index);
                self.elements[bubble_up_index] = parent;
                bubble_up_index = parent_index
            } else {
                break;
            }
        }
        self.element_positions
            .insert(current.clone(), bubble_up_index);
        self.elements[bubble_up_index] = current;
    }

Any advice on the best way to represent the ownership of the element between the vector and hashmap?

NB I'm aware I'm looking for Ord trait instead of Eq + Hash + Clone + PartialOrd as another improvement.

Typically, you don't want to perform direct indexing in Rust if you need ownership. Accordingly, I started rewriting your code, e.g. heapify could be done without cloning, using Vec::drain():

    fn heapify(&mut self) {
        let element_length = self.elements.len();
        let parent_index = self.get_parent_index(element_length - 1);
        let starting_index = parent_index + 1;
        let range = starting_index..element_length;
        for (index, elem) in range.clone().zip(self.elements.drain(range)) {
            self.element_positions
                .insert(elem, index);
        }

        for index in (0..parent_index).rev() {
            self.push_down(index);
        }
    }

However, I noticed later down the line that bubble_up() is implemented in a way that you always maintain a copy of some (all? one?) elements, since you unconditionally insert into self.element_positions and self.elements at the end of the method, so I gave up at this point.

What's the purpose of element_positions anyway? It's very strange to see an auxiliary hash map holding copied of elements in a heap implementation.

1 Like

Yes, normally heap implementations don't have the auxiliary hash map, the author of the book has done a python implementation that doesn't include as such, see here.

The reason for the hashmap is for performance gains as outlined on lines 25-30 of the java implementation, specifically around supporting a contains method that is performant - O(1) compared to O(n).

 * Performance:
 ...
 * - contains requires O(1) average time, because this implementation uses a Hash Table to support fast contains (and fast update).
 * - update, remove requires O(log n) average time, because they can leverage the "fast" contains implementation.

How to best represent this in Rust on the other hand is what I'm wondering, it seems like a multiple ownership scenario but not convinced I want something like Rc / Arc, especially when there's a Vec of elements already.

When you grow the Vec you'll invalidate any pointers to elements of the Vec, so you need some sort indirection (e.g. storing Boxes in the Vec) at least. Or an arena-based Vec that never moves on reallocation. Without cloning or some sort of encapsulated shared ownership (Arc), you'd need some sort of unsafe to boot, I think; you'll be emulating shared ownership or implementing a self-referential struct.

Alternatively I guess you could hash the element and store the hash in a HashMap<TheHash, Vec<usize>> on the premise that collisions are rare. You'll be double hashing [1] and it will make your O(1) an O(n) with a high-collision Hash implementation though, and e.g. removing from the HashMap becomes more cumbersome.


I think the general gist is that this is much more approachable in Java because you've already paid the price of indirection and/or shared ownership or whatever by mere choice of langauge. You only care about putting all your T in Arc because it's explicit and not everything is in an Arc<Mutex<_>> already. In your place, if I had the time, I'd probably do a version without the HashMap and a version with it using Arc and see if there's an appreciable difference in performance for a representative workload.


  1. but I think there's a way not to with hashbrowns raw entries? I didn't confirm this ↩ī¸Ž

3 Likes

That mostly makes sense (needing shared ownership due to the Vec resizing invalidating pointers / double hashing option).

I'm slightly confused about what the below would look like:

something like:

struct Heap {
   element: Vec<T>,
   branching_factor: usize,
   element_positions: Arc<HashMap<T, usize>>
}

?

I've not used Arc much before and trying to get my head around how this'd look in practice. Thanks for the response.

You want both Ts in Arcs as that's what will have shared ownership:

struct Heap<T> {
   element: Vec<Arc<T>>,
   branching_factor: usize,
   element_positions: HashMap<Arc<T>, usize>
}

I guess some background before continuing:

  • Arc (and Rc and so on) only give out shared references (&, not &mut)
  • But you can put a Mutex or RwLock in an Arc or RefCell in an Rc to get temporary handles that let you get a &mut
  • They can be cheaply cloned to create a new strong reference ("shared owner")
  • So if you give away an &Arc, &Rc (or Arc or Rc directly), you'll lose control of how many "shared owners" there are; but if you don't, you won't

But note that your heap is going to have similar constraints as a hash map (even if you don't utilize a hash map): the entries changing order while in the data structure is a logic error. You can't prevent this by the way -- the user could have their T item be an Arc<Mutex<U>> themselves and change things behind your back.

However, it also means there's no reason for your API to expose &mut T to the consumers, so you don't have to worry about "should I also use a Mutex and what does my API look like then" or any of that.

Instead, you could have an API where you don't expose that you're using Arc at all. By not exposing the Arcs, the consumer won't be able to clone them; you will control how many owners (strong references) there are. Because of this, you'll be able to unwrap the item from the Arc when you get it back out (you can get a T out of an Arc<T> if there's only one strong reference left). It will make comparing different implementations easier to boot.

Here's a sketch of that. I didn't test it at all but it compiles :wink:

An alternative is to be explicit that you're using an Arc, and consume and return Arc<T> instead. This avoids moving things in and out of Arcs and will be more efficient if the consumer doesn't move things in and out of Arcs themselves. You could probably just replace T with Arc<T> in all your function signatures.

1 Like

Thank you so much, that makes a lot more sense now!

I got the none HashMap version implemented today (with some guidance of the std implementation) and I'll give this a shot at some point and compare the two :slight_smile:

I may need the mutex for the update method as seen here but one step at a time I think :slight_smile: