Let's say I have HashMap<String, i32>. I do not understand how I am supposed to:
insert 42 if key "key" does not exist
increment existing value if the key "key" exists.
1st approach: use entry:
map.entry(key.clone()).or_insert(value);
but it will clone my key regardless of whether it needs to clone it or not. entry takes key by value!
What if such an entry already exists? Why do I have to clone anything?
2nd approach would be to check if key exists via contains:
if !map.contains_key(&key) {
map.insert(key, value);
}
but this is much much worse:
now it has to compute hash twice and find slot (use modulo) twice (1st time for contains and second time for insert) (as opposed to just cloning)
and modulo is super expensive operation for CPU
and this inefficiency is hidden behind innocent looking code.
What am I missing?
Can I just insert something if it does exist or get the existing value in Rust (without doing a ton of unnecessary work)?
P.S.
chatgpt suggests using raw_entry_mut, I am wondering if this is really the only option we have for using most common operation in HashMap?...
fn insert_if_absent<K, V, S>(map: &mut HashMap<K, V, S>, key: &K, value: V)
where
K: Eq + Hash + Clone,
V: Clone,
S: BuildHasher,
{
// Create a hasher from the HashMap's hasher
let mut hasher = map.hasher().build_hasher();
// Hash the key
key.hash(&mut hasher);
let hash = hasher.finish();
// Access the raw entry
let entry = map.raw_entry_mut().from_hash(hash, |k| *k == *key);
match entry {
std::collections::hash_map::RawEntryMut::Occupied(_) => {
// Entry already exists, do nothing
},
std::collections::hash_map::RawEntryMut::Vacant(v) => {
// Entry does not exist, insert the cloned key and value
v.insert_hashed_nocheck(hash, key.clone(), value);
}
}
}
Yeah, I hoped that it should be & operation instead of %, but it probably does not change much.
need to perform the duplicate work on insertion
Yes, but now imagine that you have this pattern in a loop (given key K, if it exists, increment value or insert 1). You will pay this cost every iteration.
(The 41 feels a bit weird, but if the initialization is something like 0 or Vec::new() then it doesn't look that bad).
Yes, it's unfortunate that this is locked behind an unstable feature. The problem is that bikeshedding this API is really hard and no consensus has been reached. For example the current API has a serious problem that it takes the key twice (once in the from_key and once in the or_insert), so what happens when someone passes different keys? It is also trying to do too much at once (both handle exactly your case AND more low level usecases where people want to precompute hashes and use custom equality functions).
Another option would be adopting hashbrown's approach with entry_ref, though it seems it has its own problems too (for example using From instead of something like ToOwned)
bikeshedding this API is really hard and no consensus has been reached. For example the current API has a serious problem that it takes the key twice (once in the from_key and once in the or_insert )
Yeah, makes sense. What about API like this?
This is what I would be looking for as a user:
The problem I see with this API is that if your insert/modify operation needs to move some data then you're always forced to so, even if it's actually the opposite case. This becomes even more of a problem when both operations need to move the same data, in which case you'll get an error because you can't move it into both closures at the same time (consider for example the case where the value type is something like Vec<SomeNonCopyType> and you want to push something to that Vec or initialize it with the element to push if it doesn't exist in the map). This is the reason the entry API uses an enum for the two cases, because it keeps the normal control flow and the compiler can reason about whether something is conditionally moved or not.
It is also requiring From<&'a Q> which was one of the downsides of hashbrown's entry_ref and K: Clone, V: Clone, which should not be necessary.
The point is that once you visited a key, all subsequent accesses to that key will not hit the slow path, and its cost will be amortized over all accesses to the key.
If you are updating every key, then the relative cost is only significantly higher if you only perform one pass over all keys – in which case it shouldn't matter anyway (and this is also not a typical/realistic scenario). If you are making many passes over all keys, then the more you iterate, the less expensive the operation becomes compared to the ideal.
I think it could be more "clean" to have the upsert API insert an "empty" value for new keys, and then do the update function always, like adding numbers starting from 0, or pushing to a vec starting form an empty vec. It would also solve the "cannot move into both closures" problem. But there probably are cases where you would want to put in a "special" value on creating, meaning that both variants would need to exist to satisfy all users, which would just be confusing for everyone, so there are no "easy" solutions.
Looking at this from another angle, it looks like an XY problem. If this code is in a hot path and you care about performace, you probably don't want to allocate a new hashmap every time. Especially if the keys are going to be mostly the same, you could be better off setting every value to 0 instead of allocation a new hashmap. You would probably need to make a new one from time to time to get rid of old keys, but it's worth testing if it would be faster in your case.
So, what is the point of calculating hash of name twice?
After you calculated hash, why do you traverse many items in buckets twice? Depending on implementation, just reading locations of buckets might be extremely costly random memory jumps.
Now, when you have content of buckets in caches, why do you need to compare name with whatever is in the buckets (assuming hash collision situation) many times, twice? And by the way, if you have strings, this is one more indirection (== one more random memory jump) (because a string is essentially a char *).
Like I said above, this is likely much worse than 1st option.
In the 1st option, you do pretty much the same as above ^^^^^^ , but only once + one unconditional clone of key.
Like @paramagnetic has said, it won't be "much worse" because you only perform duplicate work on insertions, i.e., the first time you've seen an item. All subsequent accesses will only perform a single lookup.
PS.: I reckon this should be better than always cloning a key needlessly.