Writing and reading large (~300GiB) HashMaps

Hi all!
I'm constructing a HashMap<u64, u32> for use as an index ; this index gets rather large, taking up around 300GiB of memory. At the moment, I'm (de-)serializing it with serde bincode, which works fine, but especially deserialization takes a lot of time. I think (please correct me if I'm wrong!) that this is due to the fact that deserialization operates by inserting one entry at a time into a new HashMap (with a new Hasher), i.e.: hashes need to be re-calculated and the HashMap probably needs to be resized a lot.

I've been thinking that, since I'm only doing lookups after deserialization, I don't really need to store the keys, only their hashes along with the hashfunction(s), as long as there are no hash collisions.

So my question is: What's the best way to store such an Index (which usually only needs to be constructed and written once) such that in can be read "quickly" (which usually happens more frequently).


(An alternative approach might be to use abomonation which seems like it might be useful in this case (as long as the index stays on the same machine). However, it does not support HashMaps, yet.)

The serde impl for dynamically-sized structures limits the initial capacity as a first line of defense (e.g. against corrupted serialization formats that would try to reserve astronomical amounts of memory).

If your key-value association is just u64s to u32s, then you can construct a flat file with 8+4 byte entries, read them one by one, and use HashMap::with_capacity(lots). I'm not sure how much it will speed up insertion, though – integer hashing and equality comparison is very fast.

Oh, and while we are at hashing – if and only if your data is trusted, try replacing the hasher of your HashMap to something that doesn't try to prevent collision-based DoS attacks (i.e. it has worse statistical properties), but works a lot faster.

As a last resort, you can just store your key-value pairs in a flat vector, sort them by key (and preferably pre-sort them before writing the index to a file), and use binary search for index lookup. This makes using the index O(log n) instead of O(1), but it is about as fast as you can get with deserialization.

Actually, thinking of it, you may be able to use a memory-mapped file, for on-demand reading and for saving precious RAM, hopefully also improving cache locality.

6 Likes

I will give this a try! I will also benchmark using different hashers for this step (but not during construction, see below).

I've actually tried the following hashers for the initial HashMap construction in the past: xx (twox), metro and FNV. If I recall correctly, none of them were faster than the default (which is SIP1-3 I think?), which has probably to do with the fact that I construct the HashMap via a map-reduce approach based on rayon, where smaller HashMaps get combined to increasingly larger ones.

This might be reasonably fast (currently, the time spent deserializing is about 5 times as much as the time spent for lookups, the trade-off might be worth).

I haven't used memory-mapped files in rust, yet, so I guess I'll learn something new :slight_smile:

I'll report back with my findings, thanks for your input!

Seconding memory mapped files or similar -- you're at or past the point where you need to start considering specialized representations for your data. Or at least slightly more specialized than HashMap.

"A database table with an index" is one of the options in that vein; databases support bulk loading and typically have good crash protection sorted out already, without having to parse the entire file!

1 Like

You can speed up the lookup with a shorter list:

Memory is chunked into pages (4kB, 2MB, depends on the OS). Loading a page can be expensive, so it helps to not load memory from a page unless it is actually needed. For that you can store the last key of each chunk and the index of that chunk in a smaller list: