Recommendations for reducing memory footprint of numerous mostly-small HashMaps?

I have a large number of HashMaps (and also HashSets), most of which have very few elements. Occasionally they may have many, which is why I care about scaling, but my memory use is looking poor, because their minimum size is 32 times the memory I need. I'm wondering if it might be efficient to dyamically switch between a Vec of tuples for small numbers (the very common case). It seems a lot of work, but my memory use also seems a bit excessive. It's hard to imagine that this hasn't been tried. Is there something out there I could be using to improve my memory footprint?

For clarity: in my current (small) test case, I've got a couple of thousand HashSetss with one element each, and a thousand HashMaps with two elements. These take up 3 MB between them.

A related question: many of my HashSets are sets of integers (wrapped in a Copy/Clone struct), and it'd be lovely to not have the overhead (which it looks like HashSet has?) of storing pointers to heap-allocated integers. In general, it seems that the optimal collections for small Copy types would be different than for large non-Copy types. Any suggestions?

3 Likes

You could create an enum with variants that provide storage. One variant would be the normal HashMap/Set, and another a "small" version (possibly a fixed size array given what you're describing). Once the small storage is exhausted, you'd dynamically transfer over to the other variant. Don't know if something like that exists in some crate already.

@vitalyd that's what I was thinking of, but didn't want to go to the effort of creating and benchmarking such an approach if it was already out there (I'm searching crates.io now). Or if it had been attempted and found ineffective.

Yeah, maybe someone will know - sorry for not being helpful. Basically, you're looking for an analog to the smallvec crate.

I'm looking at smallvec, and am wondering whether there is a strong reason for it to deal with raw pointers itself rather than just holding a Vec in one of the enum variants? I can see that it enables the len to be determined without resolving the variant, but that seems like a pretty small gain for the cost of duplicating unsafe code...

I'm currently exploring developing an optimized set type:

https://github.com/droundy/david-set

It's far from comprehensive, but so far beats or statistically ties std::collections::HashSet on all the benchmarks I've written, so pretty good so far. In fact, in my use case (fac), it reduces memory use from about 47 MB for my moderately sized test case to about 33 MB, and cuts the runtime from 27 s to under 23 s. And that is without replacing all the HashSets, and leaving all the HashMaps. So I call that pretty good.

2 Likes

You can define a data structure that's an enum that uses a VecMap ( https://crates.io/crates/vec_map ) for small associative arrays, and switches to a HashMap for larger ones.

My guess is it's a performance and size optimization. Perhaps also to allow no_std.

1 Like

@leonardo looks like vec_map assumes small integers and requires storage proportional to the largest key (rather than number of keys), so that isn't a great option for me, as the largest key is proportional to the number of hashmaps stored, which can get pretty large.
But something similar to your suggestion should work fine.

Right, I meant: https://crates.io/crates/flat_map

1 Like

Yes, an enum holding a flat_map and a HashMap would make sense. Right now I'm still working on the sets, but will definitely consider that when I get back to maps.

An alternative solution is to make std library associative arrays both memory-compact and efficient. This is how modern Python does it, explained by the good Hettinger: Raymond Hettinger Modern Python Dictionaries A confluence of a dozen great ideas PyCon 2017 - YouTube

1 Like

And in rust, by bluss, as https://github.com/bluss/ordermap

2 Likes

I've even used Ordermap in past, but I didn't remember it was based on the new compact Python dicts...