Iterating hashmap vs iterating vec

Is there some small constant, say 10, where iterating a Hashmap<u64, T> of size N takes at most 10x the time it takes to iterate a vec of size N ?

(Trying to understand if hashmap storage is large "chunks of vecs", or "lots of linked lists").

The HashMap storage is a single continuous block of memory with no chunking, akin to a single Vec; so e.g. growing the map when capacity is reached will move all its data. Ignoring the hashing part (and thus necessary extra information), this means HashMap<K, V>’s layout in memory would look a bit like a Box<[Option<(K, V)>]>.

Edit: I’m in the process of watching the CppCon talk about it at the moment, by the way very enjoyable, and it looks like it doesn’t work like an Option but instead stores the metadata regarding occupancy separately. So turn the [Option<(K, V)>; N] on the heap into something like a ([Metadata; N], [MaybeUninit<(K, V)>; N]) instead. But that’s irrelevant for the more general question of “is there some small constant …”.)

The implementation of the standard library hashmap is (currently) via the hashbrown crate which states it

is a Rust port of Google's high-performance SwissTable hash map

further noting

The original C++ version of SwissTable can be found here, and this CppCon talk gives an overview of how the algorithm works.

Some web search also gave me e.g. this blog post outlining some of the very basics how this works (the premise of the post that insertion does two lookups might no longer be true).

The standard library docs further mention – actually as the very first thing – that it’s a hash map “implemented with quadratic probing and SIMD lookup”, and in particular the term quadratic probing makes it easy to find more information on what kind of data structure and implementation we’re generally speaking of.


Now your question also concerns the question of how efficient an iterator over Rust’s HashMap is. I couldn’t quickly find out more about how exactly the iteration of a HashMap works in detail, but presumably it’s a fairly straightforward approach (with some down-sides), as the documentation of various iteration methods contains a remark of the form

In the current implementation, iterating over map takes O(capacity) time instead of O(len) because it internally visits empty buckets too.

In particular, a hashmap where you haven’t removed any elements will grow (AFAICT, as of this writing) using a doubling strategy and keeping the map always up to a maximum of 87.5% full, so you’d be between 43.75% and 87.5% full and the overhead during iteration from hitting empty buckets would presumably only be a bit worse than a factor of 2 in the worst case; other than that, iteration should presumably be able to proceed just as easily as for a Vec by simply incrementing a pointer. (Of course, only actual benchmarks can show the real overhead, and I’m assuming there’s some further overhead from additional bookkeeping and more memory being traversed from skipping over keys and meta information, or so.)

That being said, your question can probably be answered with ”yes”, there is a small constant overhead compared to Vec, but only if you don’t have a HashMap with a capacity that’s significantly higher than its size. Unlike for a HashMap, a Vec’s additional unused capacity will not negatively impact iteration, as all used space is compactly at the beginning of the data structure’s memory, not spread evenly throughout as for a HashMap.

You might run into this kind of thing if you use e.g. methods like HashMap::clear, which can be useful for improving performance by avoiding re-allocations if you create new maps repeatedly e.g. in a loop, but if those maps are of significantly different sizes and if you need to iterate through the whole map with an iterator at least once, then the overhead of keeping a disproportionately large capacity might possibly even make performance worse.

By the way, one way to solve this problem if it becomes relevant could be to use indexmap instead, which does feature O(n) iteration, as it literally contains a vector (essentially a Vec<(HashValue, K, V)>) that can be iterated directly.

7 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.