"Journaling HashMap" Does such a thing exist?

Hello,

I have a need for a collection type, and it seems like it might already exist, but I can't figure out what it might be called so I can search for it. So far, my guesses turn up unrelated things.

The key attributes are:

  • Stores key-value pairs like a HashMap
  • Can be "cloned" by reference, i.e. you can make a mutable "clone" of the original, and the original is borrowed by the operation. But the "clone" operation is cheap, even if the collection has 10,000 entries.
  • The new mutable clone can have entries added and removed, without affecting the original, which can still be read in an immutable form. (It's ok if the lookups take a little longer)
  • Coolness if you would be allowed to re-merge a clone back into the original / replace the original with the clone. (Would only be sound if there were no outstanding clones other than the one being merged back in)
  • Double-plus coolness if more complex merging operations were allowed, although conflict resolution could be a can of worms.

Basically, think of Git. Except dealing with abstract types instead of files, but Not attempting to merge within files (just handling adds and removes of entries), and living in ram as a collection type rather than manifesting in the filesystem as stand-alone software.

Does anything like this already exist? It seems like it might. What would one call something like this?

Grateful for any insight. Thank you all.

Clone always works from a reference, but cloning returns an owned instance of the same type as the borrowed variable. It cannot return a reference unless you are cloning a reference, in which case you are getting ownership of the referenced reference, so to speak. Cloning &&T would return &T, cloning &T would return T.

As far as the expense of cloning, it depends on how it is implemented. The std docs say that a type implementing Copy should have a trivial implementation of Clone that uses Copy. That way requiring that a type is Clone includes types that are both Copy and Clone.

And generally the collection cannot control what it is collecting. What if the elements are arrays of 10,000 arrays of 10,000 128-bit integers? The collection can't control the fact that the user is doing something with high time and space complexity, it can only make its own algorithmic trade-offs and hope it fits someone's use case.

It sounds like you want a HashSet or BTreeSet. HashSet and BTreeSet have the methods "difference" and "symmetric difference" to find what one has and another lacks so that they can be selectively merged. Depending on whether the values can be Copy and how large those instances are, you may or may not want Rc to wrap each value. You could make changes to what items are in the cloned collection without affecting the original, but the items themselves would be immutable. An Rc would prevent a "deep copy" of the items, so if a collection has a 10,000-byte string, the same item in the cloned collection would only use 8 bytes (on a 64-bit system) because both Rc instances would point to the same instance of String.

Generally I think you should focus on which types should have which behavior and whether they should own their data, borrow their data or own smart pointers to their data. Borrowing data is usually something that a function or closure would do, a type that borrows data can be messy if it does more than a simple one-time operation on that data. Behavior and ownership are always the most important questions for a Rust user.

1 Like

Thanks for the reply. I put "clone" in quotes because the behavior I'm thinking of is not like the rust clone trait. Maybe a better name would be "borrow_to_fork_mut" or something.

Cloning a regular HashMap is still going to make a copy of all of the HashMap internals that make it independent of the original. This is going to be an O(n) operation if the HashMap has n entries. I'm looking for a collection where the "borrow_to_fork_mut" operation is constant time regardless of the number of entries.

I think the data structures in the im crate are close to what you're looking for.

3 Likes

Thank you!! It does everything I need, and is even more flexible than what I had in mind! I love it when that happens!

1 Like

Cloning a regular HashMap is still going to make a copy of all of the HashMap internals that make it independent of the original.

That's not necessarily true. Not all types are Clone, and a copy/clone is not a hard and fast operation. It depends on the implementation. It will call "clone" on its contents but the contents could just be pointers, it's not necessarily a deep copy. If the values are Rc<Cell> then mutating the value in one place will be observable in all of its clones because there is only one Cell. In that case the cell cannot be copied deeply because of the behavior of the type that owns it.

You can't translate "copy" or "clone" to the concept of a deep copy. The operation is performed by the outermost type. Whether it uses the operation of its internal type or has its own implementation is an important distinction.

What you are describing is called a persistent data structure, more specifically a persistent map. Here's a crate with some such data structures.

1 Like

Excellent! Thanks for the crate! It's good to have multiple options. Thanks also for the name of the pattern. Now that I know what to search for, it turns out there is a wikipedia page on this idea. :slight_smile: https://en.wikipedia.org/wiki/Persistent_data_structure