Fastest way to save and load a HashMap

I have a large HashMap. I tried using serde but because it was slow I looked more and found abomonation. I implemented a much faster way to save and load it:

My Abomonation impl for HashMap
impl<K, V, H> Abomonation for HashMap<K, V, H>
where
    K: Abomonation + Clone + Eq + Hash,
    V: Abomonation + Clone,
    H: BuildHasher + Default,
{
    /// Write any additional information about `&self` beyond its binary representation.
    ///
    /// Most commonly this is owned data on the other end of pointers in `&self`. The return value
    /// reports any failures in writing to `write`.
    #[inline]
    unsafe fn entomb<W: Write>(&self, write: &mut W) -> IOResult<()> {
        let (keys, vals): (Vec<&K>, Vec<&V>) = self.iter().unzip();

        // keys
        for key in keys.iter() {
            write.write_all(std::slice::from_raw_parts(
                mem::transmute(&**key),
                mem::size_of::<K>(),
            ))?;
        }

        for element in keys.into_iter() {
            element.entomb(write)?;
        }

        // vals
        for val in vals.iter() {
            write.write_all(std::slice::from_raw_parts(
                mem::transmute(&**val),
                mem::size_of::<V>(),
            ))?;
        }
        for element in vals.into_iter() {
            element.entomb(write)?;
        }

        Ok(())
    }

    /// Recover any information for `&mut self` not evident from its binary representation.
    ///
    /// Most commonly this populates pointers with valid references into `bytes`.
    #[inline]
    unsafe fn exhume<'a, 'b>(&'a mut self, bytes: &'b mut [u8]) -> Option<&'b mut [u8]> {
        let rest = bytes;

        // keys
        let (mine, mut rest) = rest.split_at_mut(mem::size_of::<K>() * self.len());
        let keys_slice = std::slice::from_raw_parts_mut(mine.as_mut_ptr() as *mut K, self.len());

        for element in keys_slice.iter_mut() {
            let temp = rest; // temp variable explains lifetimes (mysterious!)
            rest = element.exhume(temp)?;
        }

        // vals
        let (mine, mut rest) = rest.split_at_mut(mem::size_of::<V>() * self.len());
        let vals_slice = std::slice::from_raw_parts_mut(mine.as_mut_ptr() as *mut V, self.len());

        for element in vals_slice.iter_mut() {
            let temp = rest; // temp variable explains lifetimes (mysterious!)
            rest = element.exhume(temp)?;
        }

        let len = keys_slice.len();

        let hasher = std::default::Default::default();
        let mut hash_map: HashMap<K, V, H> = HashMap::with_capacity_and_hasher(len, hasher);

        for (key, value) in keys_slice.into_iter().zip(vals_slice.into_iter()) {
            hash_map.insert(key.clone(), value.clone());
        }

        std::ptr::write(self as *mut Self, hash_map);
        Some(rest)
    }

    /// Reports the number of further bytes required to entomb `self`.
    #[inline]
    fn extent(&self) -> usize {
        let mut sum = 0;
        sum += mem::size_of::<K>() * self.len();
        sum += mem::size_of::<V>() * self.len();

        for (k, v) in self.iter() {
            sum += k.extent();
            sum += v.extent();
        }
        sum
    }
}

But the problem with this is that I have to clone the HashMap when I save it and clone it also when I load it. Which means there is twice the ram used. Also my implementation could be improved to not allocate vectors probably.

Is there a way to not do any of this and allocate the HashMap and its content in a preallocated place and then dump that memory to a file? Later loading it straight into RAM.

Unfortunately there are references in the keys and values (Vecs specifically). It would be probably much easier if there were only owned values.

You could avoid the Vec if you assume that the iteration order of the HashMap stays consistent as long as the HashMap is unchanged. I'm pretty sure this is true of the current implementation, but you should probably get this documented before relying on it:

unsafe fn entomb<W: Write>(&self, write: &mut W) -> IOResult<()> {
    for key in self.keys() {
        write.write_all(std::slice::from_raw_parts(
            mem::transmute(key),
            mem::size_of::<K>(),
        ))?;
    }
    for element in self.keys() {
        element.entomb(write)?;
    }

    for val in self.values() {
        write.write_all(std::slice::from_raw_parts(
            mem::transmute(val),
            mem::size_of::<V>(),
        ))?;
    }
    for element in self.values() {
        element.entomb(write)?;
    }
    Ok(())
}

I think you can avoid the clones in exhume by using ptr::read:

-        for (key, value) in keys_slice.into_iter().zip(vals_slice.into_iter()) {
-            hash_map.insert(key.clone(), value.clone());
+        for (key, value) in keys_slice.iter().zip(vals_slice) {
+            hash_map.insert(std::ptr::read(key), std::ptr::read(value));
         }

Note: Abomonation is scary and I am not 100% confident in the code above. :slight_smile:

1 Like

I'll test it out but looks like it should work. Although I am more interested in finding a way to do this without Abomonation. I think it is scary too :slight_smile:

Edit: All seems to be well. I will keep using that until I get something better to work. Thanks!

Have you tried serde with bincode? Serialization speed mostly depends on the back-end you use for Serde. Also make sure you use buffered I/O.

If you want it really fast, I think the main concern should be to avoid rehashing items. Maybe some low-level methods on hashbrown can do that?

2 Likes

Yeah I used bincode but I am not sure I used buffered I/O. I am working on a serde version so I can compare it easier. I encoded it into a byte slice if I remember correctly and then saved it. Isn't the buffered I/O to do with the saving part?

Note: I just looked at bincode again and I will focus on it first before working more on Abomonation..

If you're converting from/to Vec that's buffered enough. Otherwise if you're writing to a File via io::Read/Write, then you need BufReader/BufWriter to avoid a syscall on every item.

1 Like

I use:

let mut file = File::create(file_name).unwrap();
file.write_all(&bytes).unwrap();

But the saving isn't the problem. Encoding and decoding is.

How big is your hashmap, and what are you using it for? Are you sure you shouldn't be using a proper database instead?

I am using it for counter factual regret minimisation. Specifically to store the regrets. Each key corresponds to a state in a game and value to the regrets of the possible actions at that state.

There are many many states, The size is tens of Gigabytes.

I have decided not to use a proper database because when the counter factual regret minimisation algorithm runs there are loads of reads and writes and it matters a lot how fast I can do read and writes. I actually use DashMap to be specific but what I can make work with HashMap I can also do with that.

Tens of gigabytes definitely does sound like the job for a database, not a hashmap. You could try using SQLite for a fast, minimalist, single-file database.

I don't think that is anywhere near as fast as I need it to be. I'll try to find benchmarks. I essentially just need a key value store.

Another thing you could try is a flat memory-mapped file. This is possible if you can encode your data in a way that you can find your keys reasonably easily (e.g. binary search or hashing) in a flat structure. I.e., you could essentially re-implement a hash table that thinks it is working with memory, but in reality the OS would page in/out the necessary subslices and write them to file on your behalf as efficiently as it can.

Note though that SQLite has a pretty smart caching mechanism, I'd at least give it a try. At some point, you have to pay the serialization/deserialization price. And if you are using Serde, you'll pay the whole such price upfront, even if you only end up dynamically using 1% of your keys. Using a database can mitigate this problem because it can avoid reading most of the unnecessary data.

1 Like

A memory-mapped file was my first idea. I will implement that tomorrow. I am going to compare the ways to solve this that were mentioned here. I will try rusqlite too just in case :slight_smile:

Busy day tomorrow :slight_smile:

The thing is there isn't really any data in there that couldn't just simply be copied out of memory to a file and then loaded back in. So spending too much on serialization/deserialization doesn't make too much sense. The only problem I see is the Vectors that are in the values. Maybe I'll do something similar to how Abomonation does it.

1 Like

What kind of values do you have in your hashmap?

I've done some very limited benchmarking while trying to chose a key value storage where the values were json objects. For my use case, lmdb was fastest for persistence, and rust messagepack value (rmpv) was fastest for encoding/decoding.

sled was equivalent to lmdb in speed. sqlite was about 3x slower, but provides lots more powerful features.

Bincode is faster than messagepack, but only if the data structure is known at compile time - a struct or enum (not a dynamic type like json).

If the values have dynamic members or size, then another option to make deseralization faster is lazy/shallow decoding, like rust message pack (rmp, not rmpv). It parses the members upon access, not upon deserialization.

If your struct has static members with static size known at compile time (no String, Vec, references, etc.), it should be possible to avoid serialization and directly mmap the data.

However, if that is the case, it's worth considering parquet, because it solves a bunch of related issues: space efficiency, interoperability with other tools, exchanging data, etc.

Anyway, very interesting question. Depends a lot on the structure of the values being stored.

The values are Vec<f64> so unfortunately it isn't too simple to solve :slight_smile: Thanks for the suggestions. I will keep the thread updated on my progress.

lmdb-zero values are [u8] byte slices, but there is no guarantee of 64bit allignment.

sled does handle alignment of values, at least to 8 byte boundaries. It may be possible to convert an alligned slice to a Vec with no clone or allocation.

Quick update.. I will start implementing tomorrow because today I have some family stuff. I really appreciate all the suggestions and I will show the speed of encoding/decoding running because people might find it useful.

1 Like

@rich-murphey, @kornel
Got a serde with bincode running. Takes around 4 seconds to save and 6 to load a 700MB example run. This is not good enough for me unfortunately. With 32 GB of data I imagine this would take more than a minute. It wasn't too hard to implement even though I use loads of generics, associated types and such but I had to skip some foreign structs so I could do a quick test run.

This with my naive Abomonation implementation + extra clones that are needed took about 11 seconds to save and 5 to load.

I am going to look into lmdb, lmdb-zero, rmp, parquet, sled and memory-mapped files tomorrow. But the thing I fear is that those won't have fast concurrency which is a must.

If none of these seem viable I will go back to Abomonation and redesign it so it consumes the bytes instead of just borrowing it. That way I will no longer need some big clones. Also a redesign will allow me to save it in chunks instead of allocating a Vec to hold all the bytes (essentially doubling the memory).

Top Speeds:

  1. Bincode
  2. My Abomonation impl

I found more info on lmdb. the link below suggests that, if the values require 8 byte alignment, then the size of the keys and values (each) must be a multiple of 8-bytes. They say that padding fixed alignment issues.

https://openldap.org/lists/openldap-technical/201404/msg00143.html

1 Like

That's very odd. I'm using serde + messagepack dumps for 500MB-1GB files, and they're instant (disk I/O limited).

bincode should work at nearly speed of memcpy — it's been written for IPC in a browser. 120MB/s is very low for deserialization. Even serde_json can parse faster than 500MB/s.

Are your structs exceptionally complex? Are you measuring in debug mode?

3 Likes