Using ahash (non-cryptographic hash) for equality comparison considered harmful?

janriemer · February 12, 2022, 1:26pm

Hey all,

I have two questions that are closely related, so I've decided to put them in one topic.

Is it considered harmful to use non-cryptographic hash functions (NCH) like ahash for equality comparisons? I think in general, yes, but please see the context further down below.
Do two instances of the same NCH with a random seed produce both a hash collision for the same values a and b? Or in other words: when the first instance of the NCH with seed seed1 produces a collision for values a and b, does the second instance with seed2 also produces a collision for a and b (if I've understood the following article correctly(!), this should be true for murmurhash at least, but is it also true for ahash)?

So I'm aware of the following two facts:

given a hash function h(x), and the two values a and b, if h(a) != h(b), then a != b
given a (perfect) hash function h(x) that produces a 64-bit value and the two values a and b, if a != b, then the chance of h(a) == h(b) (collision) is approx. 50%, when you hash 2^32 different values (the birthday problem/paradox)

ahash only has 64-bits, so the chance of a hash collision is pretty high (~50%), when we have more than 2^32 values (which are not many).

The following refers to my second question.

What if I now have two instances of ahash (ahash1, ahash2) each with a random seed and I hash my two values that I want to compare with both ahash1 and ahash2, so that I get a (u64, u64) tuple as a result hash? Does this result in a more "secure" hash? Or do I have to use two different hash functions all together?
Looking at the birthday problem/paradox: does the following now hold true?
chance of collision is now ~50% when we have 2^64 different values (which are quite a few values).

Why do I need to know all this?

I've written a csv diffing library in Rust, which aims to be the fastest in the world (comparison of 1 million rows with 9 columns of csv takes < 500ms). It uses ahash for comparison. But when I wrote it, I haven't thought about all the pitfalls I stated above and I hope we can fix it together (it's still in early alpha for a reason).

If all is lost, we can still use a cryptographically secure hash, like blake3, and all will be good, but performance will suffer of course (it will be down to ~730ms), which is still fast, I guess!?

I'm not quite sure, when someone really needs cryptographically secure hashes. Is it only, so that an attacker(!) can't provoke a collision or that (encrypted) files can be proven to not have been manipulated/changed?
Because I imagine our csv diffing library to not be used in "hostile" environments.
We could also "feature-gate" the hashing algorithm if that makes sense.

Sorry for this essay, but I want to make things right.

What is your opinion on this? I really appreciate your answers.

H2CO3 · February 12, 2022, 4:12pm

No, I think cryptographic hashes are also used when you want to be pretty confident that different contents hash to different values. For example, Git uses SHA-1 and SHA-256 seems to be considered, exactly because they found SHA-1 to be insufficiently strong. Non-cryptographic hash functions are pretty prone to collisions when applied to real data, so I think you should also use a cryptographic hash function here.

I think the second hash function with the same algorithm but a different seed qualifies as a "different" hash function, and it is less likely to have a collision when the first one has a collision. However, this doesn't change the fact that a non-cryptographic hash function is much weaker, so the computation of collision probabilities based purely on the birthday paradox is likely way too optimistic (i.e., collisions will be much more frequent in reality, since a non-cryptographic hash is not of sufficiently high quality so as to consider its output truly uniformly distributed).

alice · February 12, 2022, 4:26pm

To answer the question about using the hash twice with a different seed, well, it depends on the hash function. It holds if you have a perfect hash function, sure, but your hash function is not perfect.

For example, in an extreme case, if your hash function is h(input, seed) = sha512(input) xor sha512(seed), then the chance of a collision when using h once is equal to the chance of collision when using the same hash function twice with two seeds.

More generally, in the academic literature on hash functions, you will find that people define various more fine-grained variations of your second property that also hold for hash functions that people use in the real world (i.e. hash functions that are not perfect). For example, most hash functions are not fully independent, which makes the birthday-paradox collision probabilities for those hash functions worse than what the birthday-paradox analysis predicts (sometimes much worse). The birthday paradox assumes full independence, which only perfect hash functions satisfy.

trentj · February 12, 2022, 5:09pm

Please elaborate on what "use for equality comparisons" means. In general, you cannot avoid collisions when hashing arbitrary data, so you can use a hash to answer "no" to the question "are these things equal?" but you cannot use a hash to answer "yes"; you must still test whether the things are actually equal. This is why e.g. HashMap::get requires Q: Hash + Eq and not merely Q: Hash - if the hash is the same, the actual values must still be compared.

Let me repeat - you cannot use a hash to prove two things are the same. (You can use a hash to prove they are not the same, though.) If you have two different items there is at least a 1 in 2⁶⁴ chance they will hash to the same value anyway. If you have 5 billion different items, there's roughly a 1 in 2 chance that at least one pair of things will hash to the same value. That's true even if the hash function is cryptographic.

The question of whether a cryptographically secure hash is required or not is application-specific, and orthogonal to the question of equality. If your data can't be manipulated adversarially, you don't need a cryptographic hash; just a regularly decent hash function will do. And HashMap also supports this use case; you just need to tell it which hasher to use. I imagine a similar approach would work for your library.

stonerfish · February 12, 2022, 5:20pm

Crypto details are over my head but I think this is a good idea. Give your users choice, speed vs accuracy.

janriemer · February 12, 2022, 5:31pm

Thank you @H2CO3 @alice @trentj for your elaborate answers!

I really learned a lot. If I could, I'd mark the answers of all three of you as solution, but because @H2CO3 was first, I'll give @H2CO3's answer the green checkmark.

I'll go with the cryptographically secure hash now. After all, correctness is much more important than speed. Otherwise the risk of collisions is just too high, as you have taught me.

And btw, blake3 is really fast for a cryptographically secure hash!

I wonder, whether I should create a PR for the ahash repo, creating a note/warning in the README to not use it for checking equality of values (it already states that it is meant to be used exclusively in HashMaps, but maybe an additional warning doesn't hurt ¯_(ツ)_/¯).

janriemer · February 12, 2022, 5:36pm

@stonerfish Generally, I'd agree to give users choices, but taking the answers into account that have been given, using a non-cryptographic hash here is wrong and I don't want to expose a wrong algorithm to the users of the library.

alice · February 12, 2022, 6:08pm

This sentence worries me. Your program should be correct regardless of whether or not it is cryptographically secure. Otherwise it sounds like you are not properly handling collisions.

janriemer · February 12, 2022, 6:41pm

@alice Hm...I'm not quite sure what you mean, but now, I'm worried, too.

To give more context: currently, the library uses ahash to hash single csv lines and compares the hashes. If hashes are equal, it (wrongly!) assumes that the lines are equal (wrong, because collisions can happen quite often).

Now, with what I've learned from your answers, using a cryptographically secure hash is basically collision-free (not quite true, but it would take an unfeasable amount of time to find a collision) and so the implementation should be correct (comparing secure hashes instead of actual content and don't worry about collisions).

Are my assumptions flawed as a whole? Sorry, I'm confused.

alice · February 12, 2022, 7:07pm

Hmm, maybe it's okay after all.

scottmcm · February 12, 2022, 7:56pm

I think it depends greatly on the probability of those collisions. For example, I'm completely fine "not handling" uuid::new_v4() collisions or SHA3-256 collisions -- I'll worry more about wolves.

But I absolutely agree that it's essential to handle them for non-cryptographic hashes, or even cryptographic hashes that are too short -- 16-bits of SHAKE256 output isn't long enough to be secure against even random preimage attacks, even though it's a very secure XOF.

trentj · February 12, 2022, 8:05pm

There should be no difference in collision rate between a cryptographically secure hash and a good non-cryptographic hash, so I think maybe you are still a bit confused. Cryptographic doesn't mean "collisions are so rare we can ignore them", it means something more like "collisions cannot be orchestrated more efficiently than brute force using known mathematical techniques". Orchestrated being the key word; you don't need a cryptographic hash except to deal specifically with adversarial input, which is why HashMap is generic and allows any kind of hash function – it's reasonable for the same program to need both kinds of hashes).

(Note that the default hasher for HashMap is actually not cryptographic, just DOS-resistant, which is a weaker guarantee.)

Using ahash doesn't mean collisions will be more frequent or that the consequences of finding a collision will be worse than blake3 - when not dealing with adversarial input, that's much more dependent on how you're using the hashes. It sounds to me like your application (large data sets with possible birthday problems) makes collisions something you need to deal with, not just ignore. Switching to a cryptographic hash doesn't solve that problem at all; you need to either increase the bits in the hash until the risk of collision is tolerable, or change your algorithm so that it is still correct even with some collisions.

scottmcm · February 12, 2022, 8:24pm

Note that this is only true for random inputs. It's usually the case that for non-cryptographic hashes that adversarial input can easily be crafted to collide -- that's what makes them non-cryptographic in the first place.

A CRC, for example, is based on a generator polynomical and is affine, so it's trivial to use math to create collisions: https://en.wikipedia.org/wiki/Cyclic_redundancy_check#Data_integrity

EDIT: Sorry, I re-read your post and you do address this.

janriemer · February 12, 2022, 9:03pm

Oh, ok, thanks, I've interpreted the other answers differently.

But this holds true only under the assumption that both hash functions produce hash values of equal bit-length, which is not true. ahash produces 64-bit hash values while blake3 produces 256-bit values. It follows that blake3 can produce much more hash values without collision.

Yes, this is what it comes down to. Sorry, I should have stated ealier that ahash produces 64-bit values while blake3 produces 256-bit values (see my point above).

In conclusion

Using blake3 with 256-bit producing hash values can be used for equality comparison (especially showing that two values are equal), because the amount of values it would take to produce a collision are astronomically high and can't be computed in the lifetime of the universe.

Thank you all for this interesting, civilized and helpful discussion. I think this is quite an important and delicate topic.

tczajka · February 12, 2022, 9:28pm

That's not necessarily true.

In fact, very simple hash functions can give you minimum possible probability of collisions for any sets of inputs (even adversarial).

For instance, if your keys are integers, the classic Carter-Wegman hash function family gets you there:
hash(x) = (a * x + b) mod p mod m
where p is a prime >= max(x+1, m), and a and b are random from 0 to p-1.

This gives 1/m collision probability for any inputs.

If all you care about is minimizing the expected numbers of collisions for some set of keys, this is an appropriate tool for the job -- as long as you never publish the random parameters or the generated hashes for the attackers to know.

And yes, you can combine two random 32-bit hash functions from such a family to get 2^-64 probability of collision.

For more, see: How to pick a hash function.

alice · February 12, 2022, 9:55pm

That hash function generally is not good enough. It is true that, given any two (possibly adversarially chosen) x and y that are not equal, that hash function will guarantee that P[h(x) = h(y)] ≤ 1/m when the seed is chosen uniformly at random, but the hash function is only 2-independent, so the guarantee does not generalize to when you have more than two values. This means that given four values x, y, z, w, the probability P[h(x) = h(y) AND h(z) = h(w)] can be greater than 1/m². An adversary would definitely be able to give you a long list of elements with way more collisions than they could with a perfect hash function.

For example, when a hashmap uses linear probing, the hash function needs to be 5-independent to guarantee constant-time operations for arbitrary inputs, including adversarial ones (source). The hash map would not have constant time operations given your hash function.

tczajka · February 12, 2022, 9:58pm

Yes. But this is not a property that is relevant for the use case described in the OP.

If you have 2^32 numbers, and use this twice, with 2*64 = 128 bits (as OP wanted), you get a guaranteed upper bound on the probability of there being any collisions: 2^-65 probability.

Note: this is better than with SHA-256, because this is true no matter what computational powers the attacker has.

janriemer · February 12, 2022, 10:21pm

Wow, this is super interesting, @tczajka! I've never heard of universal hash functions before. I have to look into this! Thank you for the link (I think it is very valuable to link sources).

Although, I have to admit, I'm slowly doubting my ability to judge which solution is now right for my problem.

Yes, this is exactly what I want!
Although, I'm a bit bewildered by all the other answers now. I have to look further into this.

But please keep discussing. This is very insightful and I'm sure we can get to an optimal solution for my use case!

tczajka · February 12, 2022, 10:33pm

@janriemer Since it sounds like you're hashing large files, see notes about hashing blocks of data without modular arithmetic and hashing variable-length data.

To go for maximum performance, you probably want to split your files into blocks of constant length, hash those, and then apply the slower variable-length hash function to those sequences of hashes.

janriemer · February 12, 2022, 10:55pm

Thank you for the link to part 2 of the blog post. I still have to look into it, though.

Unfortunately, in my case, this is not possible, because I need to hash and compare individual csv lines. Every line is assumed to have a sort of primary key (e.g. the first column), so I can identify this line in the other file, if present. I have to hash line-based, because the csv lines could be in different order between the two files.

But yes, you are right, if they were in the same order, your approach would be really fast.
Thank you for the tips and links, I'll definitely look into them! So much to learn here.

Topic		Replies	Views
Multicore dedup Vec<(u64, u64)>?	28	1275	December 1, 2022
HashMap performance	26	27714	January 22, 2020
Hash-rs needs cpus and algorithms!	19	2596	January 12, 2023
FxHash blog by Nicholas Nethercote "A brutally effective hash function in Rust"	20	5796	March 9, 2022
Inexplicable nondeterministic behavior in scientific computing code help	32	960	July 7, 2024

Using ahash (non-cryptographic hash) for equality comparison considered harmful?

Why do I need to know all this?

In conclusion

Related topics