HashMap.insert without overwriting?

Acording to my experience in C++, I often need the behaviour, when Im trying to insert, and if element exists, old value remains, and I can check it using a flag:

if (table.insert({key, value}).second) {
   ... some C++ code
}

Is Rust I wrote this:

if table.insert(key, value).is_none {
   ... Rust code
}

but this code makes anothier thing: it change value if it is exists.

Alternative way1:

if !table.contains_key(key) {
  table.insert(key, value);
   ... Rust code
}

But it is bad because here there are 2 table lookups

Alternative way2:

if match table.entry(key, value) {
                    std::collections::hash_map::Entry::Vacant(v) => {
                        v.insert(value);
                        true
                    }
                    std::collections::hash_map::Entry::Occupied(_) => false,
                } {
 ... Rust code
}

But it is a madness. Too many lines to emulate the single line C++ behaviour.

What is the best way to insert without overwrite?

If you don't need the true/false result, you can use

table.entry(key, value).or_insert(value);
10 Likes

You're not specifying what "table" is, but should it be a hashmap, would try_insert() suit you?

1 Like

Good point. If nightly Rust can be used, that's the best option.

They're using a HashMap (in the title).

If you don't want to do anything in the occupied case, you can just write it like this:

if let Entry::Vacant(v) = table.entry(key) {
    v.insert(value);
    // ... followup code
}
6 Likes

I'd simply write a small helper function, like this:

pub fn insert_new<K, V>(map: &mut HashMap<K, V>, key: K, value: V) -> bool
where
    K: Eq + Hash,
{
    match map.entry(key) {
        Occupied(_) => false,
        Vacant(entry) => {
            entry.insert(value);
            true
        }
    }
}
1 Like

You can just make a HashMap of Vec if you want to see all older versions, or make a custom type with just 2 fields:

struct MyEntry<T> {
    pub value: T,
    pub second: Option<T>
}

impl<T> MyEntry<T> {
    pub fn update(&mut self, other: T) {
        self.second = Some(self.value);
        self.value = other;
    }
}

fn main() {
    let x: HashMap<i64, MyEntry> = HashMap::new();
    ...
    let data_entry = x.entry(key).and_modify(|v| v.update(new_value)).or_insert_with(|| MyEntry { value: new_value, second: None });
    println!("{:?}", data_entry.second);
}

Or use a Vec if you want to keep all old versions. Current value will be .last() here:


fn main() {
    let x: HashMap<i64, Vec<MyDataValue>> = HashMap::new();
    ...
    let data_entry = x.entry(key).and_modify(|v| v.push(new_value)).or_insert_with(|| vec![new_value]);
    println!("{:?}", data_entry.last().unwrap());
}

And might also still do multiple lookups. (I’d have to double check whether or not entry-API contains any optimization to reduce the number of lookups; if I recall correctly its main purpose was to help with some borrow-checking issues that otherwise occur, and that it’s sometimes a convenient API).

(2 mins later) okay, checked it, occupied entries have some direct access to the contained element, vacant entries are just

pub struct VacantEntry<'a, K, V, S = DefaultHashBuilder, A: Allocator = Global> {
    hash: u64,
    key: K,
    table: &'a mut HashMap<K, V, S, A>,
}

so all it saves you is re-calculating the hash. Which, well…, is like “half” of the lookup, I suppose.

The entry API was in fact created in part to remove the requirement of repeated lookup for or_insert, and IIRC the original hashmap implementation did use this to skip a repeated lookup. However, because the insertion algorithm for the current hashmap involves potential reordering of other entries, the full walk must be performed from the starting lookup position. (This process exchanges a small amount of additional work on insertion to keep entries closer to their ideal position and reduce the work done by get overall.)

Somewhat importantly, this results in that were the entry API to do a full "for insert" lookup immediately, it could require dropping the vacant entry handle without doing an insert to fix up invariants (logically, to remove the entry again). Rather than make no-insert entry lookup pay for insert+delete (~2 units), it was chosen to make or-insert pay for get+insert (~1½ units). Oh, and relying on Drop like that is a bad idea (it requires "PPYP" leak amplification in the face of leaks/forget for soundness), and adding such now would be API breaking (change drop timing / impede NLL).

And for historical note, the lifetime limitations that helped birth the entry API are mostly not an issue anymore due to NLL. While temporaries are still live across the problematic region (thus the similar footgun with holding scrutinee locks), NLL allows their loan to expire before it actually drops.

9 Likes

It should be noted that HashMap using open addressing makes it harder to reduce the number of lookups, since searching for the entry corresponding to a given key is a bit different than searching for an empty bucket for its key (while searching for the entry you want to skip tombstones, but while searching for an empty bucket you want to consider them, and you also need to account for potentially having to resize the map).

Edit: and to be clear, the Entry API does have a small optimization in that when inserting on a vacant entry it doesn't check it the entry is already in the map, so it only needs to do a single scan for a tombstone/empty bucket. In comparison a normal insert even though it does a single scan, it needs to look for both an insert slot and an occupied entry for the given key.

5 Likes

Thanks for the context!

Oh, its very sad.
Why can Vacant entry hold a pointer to a place for new element?
I know about map-rebuilding, but entry can hold a special flag for this case

Because as mentioned, the swiss table implementation uses open addressing, meaning that lookup for get and lookup for insert are different operations. There isn't any extra work being done; the only (and entirely negligible) difference is that the get-search and insert-search are done in two separate passes, instead of potentially being fused into one pass. In fact, I believe hashbrown still does a "double lookup" for normal insert because the get lookup is SIMD accelerated in ways the insert lookup can't. If you even sometimes try to insert with a preexisting key, the "double lookup" has better amortized performance characteristics. (Which, if you're using a hashmap, is your primary metric of concern.)

5 Likes

For get we are scanning cells is some specific order (with linear probing of square probing or some other way), skipping tombstones, and stop when we meet first "empty" or "valid and equals" cell.
For insert we are scannin cells in the same order, and stop when we meet first "tombstone" cell.

Yes, these operations are different. But why cant we save first tombstone pointer while making get scan? Is scan restarting faster than saving one extra field?

I went and looked at the implementation. The reason appears ultimately to be such that finding an occupied entry does not cause table growth when at the load factor.

entry is roughly

if let Some(bucket) = map.table.find(hash, eq) {
    Entry::Occupied(..)
} else {
    map.reserve(1);
    Entry::Vacant(..)
}

The reserve call (and thus potential grow) must happen as a part of entry, as VacantEntry::insert does not have access to the impl BuildHasher. The implementation choice might make more sense with hashbrown's entry API which does fully defer the grow until VacantEntry::insert; note that attempting to find an insert slot when there isn't capacity will cause an infinite loop!

Compare find, find_insert_slot,
find_or_find_insert_slot. Also relevant: fix_insert_slot, which is potentially yet another table scan and a portion of the additional lookup cost to insert I've alluded to.

It could make sense for the rustc_entry path to do find_or_find_insert_slot eagerly, but there's also merit to keeping the two entry paths roughly parallel. A full context would probably only been known by Amanieu. (Really, though, the cost difference of splitting insert's double lookup into two passes isn't significant, and if it is, the problematic part is your hash, not the lookup.)

I recall the potential for table modification to reposition buckets to keep the average distance from ideal lower overall, but didn't see any evidence of this while scanning the implementation just now; hashbrown may have stopped doing this or I might've crossed some wires.