Is The First 32 Bits of an FNV Hash Good Enough In My Scenario?

zicklag · October 14, 2022, 12:30am

I've got a simple scenario, I have a hashable I that I want to store in a JavaScript number, which is essentially an f64. Since the result of hashing something is a u64, which can't be fit losslessly into an f64, I'm wondering if I can get away with just taking the first 32 bits of the hash which, if I understand correctly, can be stored losslessly in an f64.

I will have probably less 1000 objects that I'm hashing, I just need them not to collide.

I guess that this depends on the hasher, so assuming I'm using fnv, do you think that I should be safe? I feel like I should be more than safe, but I just wanted to make sure.

scottmcm · October 14, 2022, 12:35am

Need or want?

1000 ≈ 2¹⁰, so the Birthday bound say you need about 20 bits for a 50% chance of collision.

Thus 32 bits is probably fine for a hash table, but if you a no-collision guarantee then it's definitely insufficient.

zicklag · October 14, 2022, 12:36am

It's safe if it collides, but might cause surprising bugs. It's for a video game, and I'm using the ids to identify assets.

zicklag · October 14, 2022, 12:41am

According to this doc it looks like I can use up to 52 bits safely fit into a number, so maybe I just take the u64, and null out 12 bits to make it fit safely.

Does that sound reasonable? And would this effectively do that:

let hash: u64 = todo!();
let safe_hash = hash >> 12;

scottmcm · October 14, 2022, 12:55am

So you're trying to dedup textures, or something? Can you not just use a HashSet that will manage the collisions properly?

Also, depending exactly what you mean by "assets", generating incrementing numbers with an AtomicU32 is a pretty common approach for game ids.

zicklag · October 14, 2022, 1:00am

Yeah, I thought of that, but that would require a registration step to make sure they're all unique, and then state to store the ID. Which is possible, but less convenient in my scenario.

To explain more specifically, I've got JavaScript scripts loaded as Bevy assets. Each script, being a Bevy asset, has a unique HandleId already, and I want to represent that handle ID as a unique number in in the JS runtime.

I could use a string, but strings have a conversion cost when passing over the FFI because of the UTF-8 to UCS-16 conversion. I could also use an object with a pair of numbers representing both 32-bit sides of the u64, but that means I can't natively compare them in JavaScript with ===.

So I was trying to find the best way to fit a hash into a JS number.

Michael-F-Bryan · October 15, 2022, 2:30am

Had you considered using bigint instead of number? The bigint primitive is an arbitrary precision integer, and should be supported by most modern browsers.

There are also From implementations converting all the integer types to a js_sys::BigInt, so it should be pretty easy to convert your hash into a bigint.

The benefits are that bigint is a proper JavaScript primitive, so === will work like normal. It also means we leave all 64 bits of the hash untouched, which feels more correct even if @scottmcm's back-of-the-envelope calcs show that 32-bits is more than enough to avoid collisions in practice.

zicklag · October 15, 2022, 2:48am

I'll look into that, thanks for the suggestion. I think the biggest question is whether serde_wasm_bindgen and serde_v8 both support deserializing from bigints.

jbe · October 15, 2022, 9:15am

Even with a perfect hashing algorithm, collisions are a matter of probability. Before answering that question, you would need to decide which probability of a collision you are willing to accept.

For the probability to be practically zero (for any practical number of stored objects and any repetition count of the procedure), you should use at least 160 bit hashes (better 256 or 512 bits). Another option is to cope with collisions and handle them properly.

See Birthday_attack#mathematics on Wikipedia. According to that table, an (ideal) 32 bit hash would collide with a probability of 0.1% if 2900 elements are inserted. Thus in one of thousand runs you would have a collision.

Note that the more often you run the program (with different input), the higher will be the chance that a collision happens during one of those runs. If your program is used by 1000 people in 1000 scenarios, then the probability that in one case there is a collision will be 1000000 times higher than calculated. Hence I would be very cautious with short hash lengths and either

handle collisions properly, or
use 160 bits or more.

zicklag · October 15, 2022, 1:38pm

The hash is derived from the asset filepath, so the value doesn't change between users, which helps cut down on the number of iterations.

I'm a little confused, though, how a HashMap can get away with 64 bit hash, and a UUID can get away with only 128 bits, if 160 bits is required to avoid collisions.

Either way, I don't think I need more than 64 bits, because even if I used more bits, I'd still be in just as much trouble, because Bevy uses 64 bit hashes for it's handles.

Unfortunately, it seems that serde_v8 doesn't support deserializing from BigInts, so that might not be an option for me either.

I think for now I'll just use a bas64 encoded string to represent the hash in JS, that way I'll be as collision resistant as Bevy, and hopefully I don't suffer too much from the string conversion cost. If profiling reveals it to be an issue, I'll re-visit the issue.

kpreid · October 15, 2022, 2:10pm

HashMap tolerates collisions. It never assumes two keys are equal based on hash; it always compares with Eq. If entries in the map have the same hash, then they are stored in the same "bucket", and a lookup is a linear search of that bucket. This is why HashMap keys must implement Eq as well as Hash.

This is what the Hash trait is meant for. If it were meant for collision-free hashing, its output would be bigger than 64 bits.

jbe · October 17, 2022, 8:14am

That said, issue #10389 is still open, which is about std::any::TypeId being a 64 bit value, and which is calculated using a hash algorithm where collisions are not handled. But that's something that's planned to be fixed, as I understand it.

zicklag · October 17, 2022, 2:08pm

Yeah, I've seen that one. It seems like one of the great hidden soundness bugs of Rust, as far as how many people probably use the feature, even without knowing it possibly.

But so far it hasn't seemed to bite too many people, as far as I can tell.

Definitely hope that gets fixed eventually, and yeah, I'm pretty sure they've put some work into trying to resolve it.

system · January 15, 2023, 2:09pm

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Fnv hasher: small keys help	3	512	January 12, 2023
Simple fast hashing	4	2839	January 12, 2023
Requirements for Hashes in Hashbrown HashMaps? help	2	333	January 21, 2021
Quick Hashing tiny objects help	5	490	August 23, 2021
Check bit equiv of f32/f64?	7	358	July 18, 2021

Is The First 32 Bits of an FNV Hash Good Enough In My Scenario?

Related Topics