Hash Table based on Sha256

Is there a way to replace the hash function used in std::collections::HashMap by another hash function?

In particular I'm interested in a crate that realizes a hash table using sha(2)256. HashDoS attacks are of no concern in this scenario.

HashMap in std::collections - Rust (or HashMap in std::collections - Rust), but in general, the hasher is the S type parameter of HashMap.

There are existing crates, like fnv and hashers, that you can look at for examples.

The question then remains, if HashMap is really necessary, or if a simple dictionary is more efficient? Collisions shouldn't be expected using sha256 and therefore a lot of the internal logic of HashMap isn't really necessary. Or not?

Collision free SHA256-based hash table would require 100 novemdecillion petabytes of RAM. Everything less than that has to handle collisions.

3 Likes

Sure. That is a bit large for a real computer. :wink: Hmm... Maybe I should write a different data structure, like some kind of custom dictionary. Everything should be as fast as possible...

Maybe use a bitwise trie?

May I ask why you try to do lookups using cryptographic hashes? Hash tables usually assume hashing is fast and you usually don't need the protections cryptographic hashes provide.

Its a distributed graph, and nodes are referenced by sha256 hashes of the actual node data, i.e. edges are 32byte arrays and hence keys in the internal logic are really 32byte arrays.

Of course you could just "rehash" these keys using standard hash functions from the hashmap, but that is additional unnecessary time. We expact 50000 nodes per second and hence rehashing would be a bottleneck. The main point is, that we don't need to compute the keys from the data, as this was already done. Hence using sha256 would save "key generation time"

I'd probably put the sha256 hash as a key to the normal HashTable, eg HashTable<Sha256::Digest, WhateverValue>.

That would mean that every key needs to be rehashed. This is unefficient. I think I go with binary tries instead. Thanks anyway

You can write a wrapper around sha2::Sha256 which will implement std::hash::Hasher by truncating hash result to first 64 bits, but honestly I would recommend to go first with the simple "rehashing" solution. You always will be able to replace it with a more efficient solution if needed, but I highly doubt it will be a bottleneck of your application (of course do not forget to replace hasher function to a non-DDoS resistant one).

1 Like

Thanks for the advise newpavlov. That sounds reasonable.

The aforementioned fnv hasher is fairly quick, and 32 bytes is still (somewhat) in the "small" key range. So, you can experiment with a HashMap that uses fnv's hasher feeding it the digest bytes. I'd at least try that before rolling a custom data structure. http://cglab.ca/~abeinges/blah/hash-rs/ has some analysis, and it appears it'd more than suffice for your expected workload of 50k nodes/sec (at least in terms of hash speed).

1 Like

@vorner HashTable<Sha256::Digest, WhateverValue>; we are talking about std::collections::HashMap, right? Just asking to be sure.

BTW, correct type for the hash table will be HashMap<GenericArray<u8; Sha256::OutputSize>, YourType> (OutputSize will be either from FixedOutput or from Digest trait), and one day with const generics it will be HashMap<[u8; Sha256::OutputSize], YourType>.

For cryptographic hashes the entropy is always evenly distributed across all bits, i.e. it's perfectly fine to truncate these hashes and the first 32/64 bits are as random as all the other bits.

Hence, you can "hash" your SHA-based key by not hashing it at all, and just reading its prefix.

3 Likes

Another thing you can use for perfectly distributed/random keys, is btree. It compares only as much as it needs. I used it successfully to deduplicate contents of all files on my entire disk.

Just to be sure that I understand the nohash-hasher correctly:

Lets assume we use arrays of size 32 byte as keys filled with "random-like" bits from sha256. Of course these keys are to large to be used directly, so we "hash" them down to, say 4 byte (u32), which is the internal key in the hashmap. The fastest "hash" here is just truncation to the first four bytes.

However the nohash-hasher is not performing the truncation 32byte => 4 byte, right?

However notice, that often one can not just use the truncated 4 bytes as the keys in the first place, as collisions might be way to likely then. So one can not use:

let mut table = HashMap::<u32, FooType, BuildNoHashHasher<u32>>::default()

because of collisions. Then what one really want is to use the full 32bytes as keys with internal truncation to 4bytes and additional collision handling, i.e. something like:

HashMap::<[u8; 32], FooType, BuildNoHashHasher<u32>>::default()

but if my intuition about nohasher is correct, that does not work. It compiles, but does it work as expected? I mean does it use [u8,32] as key and internal truncation to 4byte keys with proper collision handling?

If I remember correctly Rust's Hash trait has a fixed size, and the hasher doesn't change that, so you can't do anything about it.

Also, don't worry. Using [u8; 32] for a hash would be silly, because for hash tables it's not supposed to be a unique identifier, but only a way to approximately equally distribute values among existing buckets. Your hash table will have number of buckets close to number of items in the table, which is likely smaller than 2^32 items, and definitely way smaller than 2^256 items.

For example, if your hash table contains 100 items, then it only needs 7-bit hash to work optimally.

Then the latter definition of table should work just fine. With collision handling and all..