Hey all,
I have two questions that are closely related, so I've decided to put them in one topic.
- Is it considered harmful to use non-cryptographic hash functions (NCH) like
ahashfor equality comparisons? I think in general, yes, but please see the context further down below. - Do two instances of the same NCH with a random seed produce both a hash collision for the same values
aandb? Or in other words: when the first instance of the NCH with seedseed1produces a collision for valuesaandb, does the second instance withseed2also produces a collision foraandb(if I've understood the following article correctly(!), this should be true for murmurhash at least, but is it also true forahash)?
So I'm aware of the following two facts:
- given a hash function
h(x), and the two valuesaandb, ifh(a) != h(b), thena != b - given a (perfect) hash function
h(x)that produces a 64-bit value and the two valuesaandb, ifa != b, then the chance ofh(a) == h(b)(collision) is approx. 50%, when you hash 2^32 different values (the birthday problem/paradox)
ahash only has 64-bits, so the chance of a hash collision is pretty high (~50%), when we have more than 2^32 values (which are not many).
The following refers to my second question.
What if I now have two instances of ahash (ahash1, ahash2) each with a random seed and I hash my two values that I want to compare with both ahash1 and ahash2, so that I get a (u64, u64) tuple as a result hash? Does this result in a more "secure" hash? Or do I have to use two different hash functions all together?
Looking at the birthday problem/paradox: does the following now hold true?
chance of collision is now ~50% when we have 2^64 different values (which are quite a few values).
Why do I need to know all this?
I've written a csv diffing library in Rust, which aims to be the fastest in the world (comparison of 1 million rows with 9 columns of csv takes < 500ms). It uses ahash for comparison. But when I wrote it, I haven't thought about all the pitfalls I stated above and I hope we can fix it together (it's still in early alpha for a reason). ![]()
If all is lost, we can still use a cryptographically secure hash, like blake3, and all will be good, but performance will suffer of course (it will be down to ~730ms), which is still fast, I guess!?
I'm not quite sure, when someone really needs cryptographically secure hashes. Is it only, so that an attacker(!) can't provoke a collision or that (encrypted) files can be proven to not have been manipulated/changed?
Because I imagine our csv diffing library to not be used in "hostile" environments.
We could also "feature-gate" the hashing algorithm if that makes sense.
Sorry for this essay, but I want to make things right. ![]()
What is your opinion on this? I really appreciate your answers. ![]()