I have two questions that are closely related, so I've decided to put them in one topic.
- Is it considered harmful to use non-cryptographic hash functions (NCH) like
ahashfor equality comparisons? I think in general, yes, but please see the context further down below.
- Do two instances of the same NCH with a random seed produce both a hash collision for the same values
b? Or in other words: when the first instance of the NCH with seed
seed1produces a collision for values
b, does the second instance with
seed2also produces a collision for
b(if I've understood the following article correctly(!), this should be true for murmurhash at least, but is it also true for
So I'm aware of the following two facts:
- given a hash function
h(x), and the two values
h(a) != h(b), then
a != b
- given a (perfect) hash function
h(x)that produces a 64-bit value and the two values
a != b, then the chance of
h(a) == h(b)(collision) is approx. 50%, when you hash 2^32 different values (the birthday problem/paradox)
ahash only has 64-bits, so the chance of a hash collision is pretty high (~50%), when we have more than 2^32 values (which are not many).
The following refers to my second question.
What if I now have two instances of
ahash (ahash1, ahash2) each with a random seed and I hash my two values that I want to compare with both ahash1 and ahash2, so that I get a
(u64, u64) tuple as a result hash? Does this result in a more "secure" hash? Or do I have to use two different hash functions all together?
Looking at the birthday problem/paradox: does the following now hold true?
chance of collision is now ~50% when we have 2^64 different values (which are quite a few values).
I've written a csv diffing library in Rust, which aims to be the fastest in the world (comparison of 1 million rows with 9 columns of csv takes < 500ms). It uses
ahash for comparison. But when I wrote it, I haven't thought about all the pitfalls I stated above and I hope we can fix it together (it's still in early alpha for a reason).
If all is lost, we can still use a cryptographically secure hash, like blake3, and all will be good, but performance will suffer of course (it will be down to ~730ms), which is still fast, I guess!?
I'm not quite sure, when someone really needs cryptographically secure hashes. Is it only, so that an attacker(!) can't provoke a collision or that (encrypted) files can be proven to not have been manipulated/changed?
Because I imagine our csv diffing library to not be used in "hostile" environments.
We could also "feature-gate" the hashing algorithm if that makes sense.
Sorry for this essay, but I want to make things right.
What is your opinion on this? I really appreciate your answers.