Hey all,
I have two questions that are closely related, so I've decided to put them in one topic.
- Is it considered harmful to use non-cryptographic hash functions (NCH) like
ahash
for equality comparisons? I think in general, yes, but please see the context further down below. - Do two instances of the same NCH with a random seed produce both a hash collision for the same values
a
andb
? Or in other words: when the first instance of the NCH with seedseed1
produces a collision for valuesa
andb
, does the second instance withseed2
also produces a collision fora
andb
(if I've understood the following article correctly(!), this should be true for murmurhash at least, but is it also true forahash
)?
So I'm aware of the following two facts:
- given a hash function
h(x)
, and the two valuesa
andb
, ifh(a) != h(b)
, thena != b
- given a (perfect) hash function
h(x)
that produces a 64-bit value and the two valuesa
andb
, ifa != b
, then the chance ofh(a) == h(b)
(collision) is approx. 50%, when you hash 2^32 different values (the birthday problem/paradox)
ahash
only has 64-bits, so the chance of a hash collision is pretty high (~50%), when we have more than 2^32 values (which are not many).
The following refers to my second question.
What if I now have two instances of ahash
(ahash1, ahash2) each with a random seed and I hash my two values that I want to compare with both ahash1 and ahash2, so that I get a (u64, u64)
tuple as a result hash? Does this result in a more "secure" hash? Or do I have to use two different hash functions all together?
Looking at the birthday problem/paradox: does the following now hold true?
chance of collision is now ~50% when we have 2^64 different values (which are quite a few values).
Why do I need to know all this?
I've written a csv diffing library in Rust, which aims to be the fastest in the world (comparison of 1 million rows with 9 columns of csv takes < 500ms). It uses ahash
for comparison. But when I wrote it, I haven't thought about all the pitfalls I stated above and I hope we can fix it together (it's still in early alpha for a reason).
If all is lost, we can still use a cryptographically secure hash, like blake3, and all will be good, but performance will suffer of course (it will be down to ~730ms), which is still fast, I guess!?
I'm not quite sure, when someone really needs cryptographically secure hashes. Is it only, so that an attacker(!) can't provoke a collision or that (encrypted) files can be proven to not have been manipulated/changed?
Because I imagine our csv diffing library to not be used in "hostile" environments.
We could also "feature-gate" the hashing algorithm if that makes sense.
Sorry for this essay, but I want to make things right.
What is your opinion on this? I really appreciate your answers.