Is there some small constant, say 10, where iterating a Hashmap<u64, T> of size N takes at most 10x the time it takes to iterate a vec of size N ?

(Trying to understand if hashmap storage is large "chunks of vecs", or "lots of linked lists").

Is there some small constant, say 10, where iterating a Hashmap<u64, T> of size N takes at most 10x the time it takes to iterate a vec of size N ?

(Trying to understand if hashmap storage is large "chunks of vecs", or "lots of linked lists").

The `HashMap`

storage is a single continuous block of memory with no chunking, akin to a single `Vec`

; so e.g. growing the map when capacity is reached will move all its data. Ignoring the hashing part (and thus necessary extra information), this means `HashMap<K, V>`

â€™s layout in memory would look a bit like a `Box<[Option<(K, V)>]>`

.

*Edit:* Iâ€™m in the process of watching the CppCon talk about it at the moment, by the way very enjoyable, and it looks like it *doesnâ€™t* work like an `Option`

but instead stores the metadata regarding occupancy separately. So turn the `[Option<(K, V)>; N]`

on the heap into something like a `([Metadata; N], [MaybeUninit<(K, V)>; N])`

instead. But thatâ€™s irrelevant for the more general question of â€śis there some small constant â€¦â€ť.)

The implementation of the standard library hashmap is (currently) via the hashbrown crate which states it

is a Rust port of Google's high-performance SwissTable hash map

further noting

The original C++ version of SwissTable can be found here, and this CppCon talk gives an overview of how the algorithm works.

Some web search also gave me e.g. this blog post outlining some of the very basics how this works (the premise of the post that insertion does two lookups might no longer be true).

The standard library docs further mention â€“ actually as the very first thing â€“ that itâ€™s a hash map â€śimplemented with quadratic probing and SIMD lookupâ€ť, and in particular the term quadratic probing makes it easy to find more information on what kind of data structure and implementation weâ€™re generally speaking of.

Now your question also concerns the question of how efficient an *iterator* over Rustâ€™s `HashMap`

is. I couldnâ€™t quickly find out more about how exactly the iteration of a `HashMap`

works in detail, but presumably itâ€™s a fairly straightforward approach (with some down-sides), as the documentation of various iteration methods contains a remark of the form

In the current implementation, iterating over map takes O(capacity) time instead of O(len) because it internally visits empty buckets too.

In particular, a hashmap where you havenâ€™t removed any elements will grow (AFAICT, as of this writing) using a doubling strategy and keeping the map always up to a maximum of 87.5% full, so youâ€™d be between 43.75% and 87.5% full and the overhead during iteration from hitting empty buckets would presumably only be a bit worse than a factor of 2 in the worst case; other than that, iteration should presumably be able to proceed just as easily as for a `Vec`

by simply incrementing a pointer. (Of course, only actual benchmarks can show the *real* overhead, and Iâ€™m assuming thereâ€™s some further overhead from additional bookkeeping and more memory being traversed from skipping over keys and meta information, or so.)

That being said, your question *can* probably be answered with â€ťyesâ€ť, there is a *small* constant overhead compared to `Vec`

, but *only* if you donâ€™t have a `HashMap`

with a capacity thatâ€™s significantly higher than its size. Unlike for a `HashMap`

, a `Vec`

â€™s additional unused capacity will *not* negatively impact iteration, as all used space is compactly at the beginning of the data structureâ€™s memory, not spread evenly throughout as for a `HashMap`

.

You might run into this kind of thing if you use e.g. methods like HashMap::clear, which *can* be useful for improving performance by avoiding re-allocations if you create new maps repeatedly e.g. in a loop, but if those maps are of significantly different sizes and if you need to iterate through the whole map with an iterator at least once, then the overhead of keeping a disproportionately large capacity might possibly even make performance *worse*.

By the way, one way to solve this problem if it becomes relevant could be to use indexmap instead, which does feature O(n) iteration, as it *literally* contains a vector (essentially a `Vec<(HashValue, K, V)>`

) that can be iterated directly.

7 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.