Direct (non-hashing) Hasher for HashMaps?

Since you can use a custom Hasher with a HashMap, is there a non-hashing version for basic, fixed-size key types (e.g. usize, or a tuple of two usize values, or even enum variants?). I see that the hasher trait write methods take various size primitives or &[u8] and so need to work with any length input (bah strings!). Thus a HashMap<usize, T> bounds the keys' range, but only outside of the knowledge/control of the hasher, so that's probably a fun disconnect...

One workaround, since I'm using monotonically increasing integers as keys, would be a HashMap based on a Vec that never renumbers its indices if elements are removed. So just a Vec<Option<T>> with the remove method setting the appropriate item to None. But then that limits it to only work with usize keys, and will lead to bloat if common use is to stay at a small number of items by regular paired add/delete ops.

It seems that at least somebody has considered this, as there is the vec_map crate. But, fortunately, a plain HashMap<usize, T> will "just work", so this is just an excuse to explore the intricacies of Rust. Nonetheless, any help or discussion is welcomed. Thanks!

Sounds like you're looking for the nohash hasher

1 Like

Thank you! How did I not find that??

You might also be interested in tinyset which has optimized sets for those types... Although I now notice that you're looking at a map, not a set.

2 Likes

Thanks! Tinyset may prove useful, as I also expect to be comparing sets of keys. Though its limit of 22 usize values to stay stack allocated feels restrictive. But I can only assume that even larger sets would still offer an improvement over the basic HashSet as long as it avoids hashing altogether as well.

It makes sense that nohash_hasher caps out at u64 (as that's exactly what a hash produces), but that leaves me still wondering about keys like (usize, usize) or u128/bigints. But this is great!

Don't arbitrary-length strings have the same problem? Hash functions by definition can loose information, they can (and do) collide sometimes. Hash tables and good quality hash functions are probabilistic structures that only make sure that such collisions are rare and cheap on average. They don't, can't, eliminate collisions, so if you have a lot of long keys, they will eventually occur. In the degenerate case, return 0 would be a "correct" implementation of a hash function, and so is return self.0 for a (u64, u64), etc.

Yeah, thus why keys must implement equality checks. And the probabilistic intent of hashing definitely makes sense for arbitrarily-sized, non-sequential objects; 16 quintillion possible unique strings is a lot.

However, my logical 'minimal-hash' hasher for keys of (usize, usize) which come from a counter starting at 0 would be to drop each to (u32, u32). That would at least remain unique for the first 16 quintillion combinations of (4 billion x 4 billion). If they were truly arbitrary I guess I'd do a bitwise XOR on each pair of bits to compress two usize down to one and hopefully minimize collisions... Interesting use case. Actually, XOR on (a, b) of bit a1 vs b64, a2 vs b63, a3 vs b62, etc. might do quite nicely for both strongly skewed and arbitrary memory layouts...

I feel like 22 usize values in 64 bits of store age is considerably more than you could reasonably hope for, although admittedly if you knew they'd all be very small integers you could expect more.

1 Like

I'm sorry, after a bit of thinking and reading the tinyset code, I realized that the documentation you were reading was in error. I've published a new version with a documentation fix. You can store no more than seven numbers in a set using just the storage size of a pointer on the stack (64 or 32 bits). They just have to be moderately small elements (numerically), and not too widely spaced apart. Larger elements or elements that are more widely spaced place a lower limit on the number of elements that can be stored without a heap allocation.

Once tinyset resorts to heap allocation, it remains far more compact than a HashSet, particularly for small or densely spaced elements. At best it takes one bit per element, and at worst it takes 64 bits per element (unless you're using a SetUsize on a 32-bit computer, or a SetU32 in which case the worst is 32 bits per element.

2 Likes

Thanks for the update! My statement about the size being 'restrictive' was misleading - I meant that my use case would grow in number of indexes/keys quickly, so it wasn't an optimization I could aim to use.

But, I'm definitely interested to learn more about stack limits. I thought L1 cache sizes being on the order of dozens to hundreds of kB meant that whole structs could happily be loaded on the stack up to sizes of about 4kB (source: Rust in Action). Is it safe to assume that the limitations applicable below that are size of a single stack frame, data and address bus widths, etc. and basically bound what can be accomplished in a single core cycle? The number you gave does match up with what I understand as the line between where groups of primitive values (say[f64; 7]) are faster to copy than to reference, so perhaps the picture I have of low-level limitations is becoming clearer.

Yes, reasonably large structs can fit in a cache line. This is why it's often worth storing data directly in a type rather than on the heap. A tinyset is optimized to take as little memory as possible for small sets, and then use the heap efficiently for larger sets. If you're using a large set with small integers in it, then tinyset will probably be far more efficient than a HashSet would be, because it uses bit tweaking to store values as efficiently as I was able to figure out.

The numbers I give actually just relate to how tinyset packs information into a single 64 bit number, and isn't related to RAM at all. Since each usize occupies 64 bits, you might only expect to fit one value into a single 64 bit integer. However, if the values are small, they can be represented with fewer bits, and this is what tinyset does. It uses pointer tagging to store either a pointer to a data structure on the heap or a set of bits holding a set represented directly. There is a fair amount of bit twiddling required, but avoiding separate allocation on the heap is likely worthwhile, because it means your entire set can be read in a single cache line that you presumably have already read.