Hash Table based on Sha256


#1

Is there a way to replace the hash function used in std::collections::HashMap by another hash function?

In particular I’m interested in a crate that realizes a hash table using sha(2)256. HashDoS attacks are of no concern in this scenario.


#2

https://doc.rust-lang.org/std/collections/struct.HashMap.html#method.with_hasher (or https://doc.rust-lang.org/std/collections/struct.HashMap.html#method.with_capacity_and_hasher), but in general, the hasher is the S type parameter of HashMap.

There are existing crates, like fnv and hashers, that you can look at for examples.


#3

The question then remains, if HashMap is really necessary, or if a simple dictionary is more efficient? Collisions shouldn’t be expected using sha256 and therefore a lot of the internal logic of HashMap isn’t really necessary. Or not?


#4

Collision free SHA256-based hash table would require 100 novemdecillion petabytes of RAM. Everything less than that has to handle collisions.


#5

Sure. That is a bit large for a real computer. :wink: Hmm… Maybe I should write a different data structure, like some kind of custom dictionary. Everything should be as fast as possible…

Maybe use a bitwise trie?


#6

May I ask why you try to do lookups using cryptographic hashes? Hash tables usually assume hashing is fast and you usually don’t need the protections cryptographic hashes provide.


#7

Its a distributed graph, and nodes are referenced by sha256 hashes of the actual node data, i.e. edges are 32byte arrays and hence keys in the internal logic are really 32byte arrays.

Of course you could just “rehash” these keys using standard hash functions from the hashmap, but that is additional unnecessary time. We expact 50000 nodes per second and hence rehashing would be a bottleneck. The main point is, that we don’t need to compute the keys from the data, as this was already done. Hence using sha256 would save “key generation time”


#8

I’d probably put the sha256 hash as a key to the normal HashTable, eg HashTable<Sha256::Digest, WhateverValue>.


#9

That would mean that every key needs to be rehashed. This is unefficient. I think I go with binary tries instead. Thanks anyway


#10

You can write a wrapper around sha2::Sha256 which will implement std::hash::Hasher by truncating hash result to first 64 bits, but honestly I would recommend to go first with the simple “rehashing” solution. You always will be able to replace it with a more efficient solution if needed, but I highly doubt it will be a bottleneck of your application (of course do not forget to replace hasher function to a non-DDoS resistant one).


#11

Thanks for the advise newpavlov. That sounds reasonable.


#12

The aforementioned fnv hasher is fairly quick, and 32 bytes is still (somewhat) in the “small” key range. So, you can experiment with a HashMap that uses fnv's hasher feeding it the digest bytes. I’d at least try that before rolling a custom data structure. http://cglab.ca/~abeinges/blah/hash-rs/ has some analysis, and it appears it’d more than suffice for your expected workload of 50k nodes/sec (at least in terms of hash speed).


#13

@vorner HashTable<Sha256::Digest, WhateverValue>; we are talking about std::collections::HashMap, right? Just asking to be sure.


#14

BTW, correct type for the hash table will be HashMap<GenericArray<u8; Sha256::OutputSize>, YourType> (OutputSize will be either from FixedOutput or from Digest trait), and one day with const generics it will be HashMap<[u8; Sha256::OutputSize], YourType>.


#15

For cryptographic hashes the entropy is always evenly distributed across all bits, i.e. it’s perfectly fine to truncate these hashes and the first 32/64 bits are as random as all the other bits.

Hence, you can “hash” your SHA-based key by not hashing it at all, and just reading its prefix.


#16

Another thing you can use for perfectly distributed/random keys, is btree. It compares only as much as it needs. I used it successfully to deduplicate contents of all files on my entire disk.