How to efficiently insert something into HashMap?

Let's say I have HashMap<String, i32>. I do not understand how I am supposed to:

  • insert 42 if key "key" does not exist
  • increment existing value if the key "key" exists.

1st approach: use entry:

map.entry(key.clone()).or_insert(value);

but it will clone my key regardless of whether it needs to clone it or not. entry takes key by value!
What if such an entry already exists? Why do I have to clone anything?

2nd approach would be to check if key exists via contains:

    if !map.contains_key(&key) {
        map.insert(key, value);
    }

but this is much much worse:

  • now it has to compute hash twice and find slot (use modulo) twice (1st time for contains and second time for insert) (as opposed to just cloning)
    • and modulo is super expensive operation for CPU
  • and this inefficiency is hidden behind innocent looking code.

What am I missing?

Can I just insert something if it does exist or get the existing value in Rust (without doing a ton of unnecessary work)?

P.S.

chatgpt suggests using raw_entry_mut, I am wondering if this is really the only option we have for using most common operation in HashMap?...

fn insert_if_absent<K, V, S>(map: &mut HashMap<K, V, S>, key: &K, value: V)
where
    K: Eq + Hash + Clone,
    V: Clone,
    S: BuildHasher,
{
    // Create a hasher from the HashMap's hasher
    let mut hasher = map.hasher().build_hasher();
    // Hash the key
    key.hash(&mut hasher);
    let hash = hasher.finish();

    // Access the raw entry
    let entry = map.raw_entry_mut().from_hash(hash, |k| *k == *key);
    match entry {
        std::collections::hash_map::RawEntryMut::Occupied(_) => {
            // Entry already exists, do nothing
        },
        std::collections::hash_map::RawEntryMut::Vacant(v) => {
            // Entry does not exist, insert the cloned key and value
            v.insert_hashed_nocheck(hash, key.clone(), value);
        }
    }
}
6 Likes

Nobody actually uses modulo tho. All modern hash maps create arrays of a size that is a power of two, so the modulo operation is just a bitwise and.

Also note that you only need to perform the duplicate work on insertion. Subsequent accesses will only need to perform a single lookup:

if let Some(value) = map.get_mut(&key) {
    *value += 1;
} else {
    map.insert(key.to_owned(), 1); // never hit more than once per key
}

Anyway, what you exactly want will be possible with the raw entry API.

5 Likes

Yeah, I hoped that it should be & operation instead of %, but it probably does not change much.

need to perform the duplicate work on insertion

Yes, but now imagine that you have this pattern in a loop (given key K, if it exists, increment value or insert 1). You will pay this cost every iteration.

will be possible with the raw entry API.

Fabulous! I am just a bit surprised that one of the most common use-case of HashMap, requires so much code and is not (yet) implemented in stable.

2 Likes

The short way using the raw entry API would be this:

map.raw_entry_mut().from_key(key).or_insert(key.to_owned(), value)

For this case it becomes something like this:

*map.raw_entry_mut().from_key(key).or_insert(key.to_owned(), 41).1 += 1;

(The 41 feels a bit weird, but if the initialization is something like 0 or Vec::new() then it doesn't look that bad).


Yes, it's unfortunate that this is locked behind an unstable feature. The problem is that bikeshedding this API is really hard and no consensus has been reached. For example the current API has a serious problem that it takes the key twice (once in the from_key and once in the or_insert), so what happens when someone passes different keys? It is also trying to do too much at once (both handle exactly your case AND more low level usecases where people want to precompute hashes and use custom equality functions).

Another option would be adopting hashbrown's approach with entry_ref, though it seems it has its own problems too (for example using From instead of something like ToOwned)

8 Likes

bikeshedding this API is really hard and no consensus has been reached. For example the current API has a serious problem that it takes the key twice (once in the from_key and once in the or_insert )

Yeah, makes sense. What about API like this?
This is what I would be looking for as a user:

use std::hash::Hasher;

fn upsert<K, V, Q, S, Insert, Modify>(map: &mut std::collections::HashMap<K, V, S>,
                                      key: &Q,
                                      insert: Insert,
                                      modify: Modify)
   where K: Eq + std::hash::Hash + Clone,
         S: std::hash::BuildHasher,
         K: std::borrow::Borrow<Q>,
         K: for<'a> From<&'a Q>,
         Q: std::hash::Hash + Eq + ?Sized,
         for<'a> &'a Q: Into<K>,
         V: Clone,
         Insert: FnOnce() -> V,
         Modify: FnOnce(&mut V)
{
   let hash = {
      let mut hasher = map.hasher().build_hasher();
      key.hash(&mut hasher);
      hasher.finish()
   };


   let entry = map.raw_entry_mut().from_hash(hash, |k| *k.borrow() == *key);

   match entry {
      std::collections::hash_map::RawEntryMut::Occupied(mut entry) => {
         modify(entry.get_mut());
      }
      std::collections::hash_map::RawEntryMut::Vacant(v) => {
         let key = key.into();
         v.insert_hashed_nocheck(hash, key, insert());
      }
   }
}

#[cfg(test)]
mod tests {
   use super::*;
   use pretty_assertions::{assert_eq, assert_ne, assert_str_eq};

   #[test]
   fn test_upsert_inserts_if_does_not_exist() {
      let mut hash: std::collections::HashMap<String, i32> = Default::default();
      upsert(&mut hash, "key", || 42, |v| *v += 1);
      assert_eq!(*hash.get("key").unwrap(), 42);
   }
   #[test]
   fn test_upsert_modifies_if_key_exists() {
      let mut hash: std::collections::HashMap<String, i32> = Default::default();
      upsert(&mut hash, "key", || 42, |v| *v += 1);
      upsert(&mut hash, "key", || 42, |v| *v += 1);
      assert_eq!(*hash.get("key").unwrap(), 43);
   }
}

The problem I see with this API is that if your insert/modify operation needs to move some data then you're always forced to so, even if it's actually the opposite case. This becomes even more of a problem when both operations need to move the same data, in which case you'll get an error because you can't move it into both closures at the same time (consider for example the case where the value type is something like Vec<SomeNonCopyType> and you want to push something to that Vec or initialize it with the element to push if it doesn't exist in the map). This is the reason the entry API uses an enum for the two cases, because it keeps the normal control flow and the compiler can reason about whether something is conditionally moved or not.

It is also requiring From<&'a Q> which was one of the downsides of hashbrown's entry_ref and K: Clone, V: Clone, which should not be necessary.

3 Likes

Wow, these are pretty cool insights! Thanks!

The point is that once you visited a key, all subsequent accesses to that key will not hit the slow path, and its cost will be amortized over all accesses to the key.

If you are updating every key, then the relative cost is only significantly higher if you only perform one pass over all keys – in which case it shouldn't matter anyway (and this is also not a typical/realistic scenario). If you are making many passes over all keys, then the more you iterate, the less expensive the operation becomes compared to the ideal.

2 Likes

Ah, got it know --- thanks.
Indeed, it makes the situation a bit better.

This will clone the key even when it already exists. Use or_insert_with to avoid that.

11 Likes

I think it could be more "clean" to have the upsert API insert an "empty" value for new keys, and then do the update function always, like adding numbers starting from 0, or pushing to a vec starting form an empty vec. It would also solve the "cannot move into both closures" problem. But there probably are cases where you would want to put in a "special" value on creating, meaning that both variants would need to exist to satisfy all users, which would just be confusing for everyone, so there are no "easy" solutions.

Looking at this from another angle, it looks like an XY problem. If this code is in a hot path and you care about performace, you probably don't want to allocate a new hashmap every time. Especially if the keys are going to be mostly the same, you could be better off setting every value to 0 instead of allocation a new hashmap. You would probably need to make a new one from time to time to get rid of old keys, but it's worth testing if it would be faster in your case.

3 Likes

I think the solution to this question should be #2: .get_mut(). This is the most efficient stable way of using HashMaps. I use it like this:

match counts.get_mut(name) {
    None => {
        counts.insert(name.to_owned(), 1);
    }
    Some(n) => *n += 1,
}
1 Like

So, what is the point of calculating hash of name twice?
After you calculated hash, why do you traverse many items in buckets twice? Depending on implementation, just reading locations of buckets might be extremely costly random memory jumps.
Now, when you have content of buckets in caches, why do you need to compare name with whatever is in the buckets (assuming hash collision situation) many times, twice? And by the way, if you have strings, this is one more indirection (== one more random memory jump) (because a string is essentially a char *).

Like I said above, this is likely much worse than 1st option.

In the 1st option, you do pretty much the same as above ^^^^^^ , but only once + one unconditional clone of key.

1 Like

This is exactly the same as my solution, to which the explanation has already been posted.

1 Like

Like @paramagnetic has said, it won't be "much worse" because you only perform duplicate work on insertions, i.e., the first time you've seen an item. All subsequent accesses will only perform a single lookup.

PS.: I reckon this should be better than always cloning a key needlessly.

2 Likes

Agree.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.