Checksum Rust complex types (struct, enum, tuple, etc)

Hello everyone

I have a design problem I am trying to solve using Rust that I have previously solved for relational databases using SQL for many years now (example: SQL Server checksum, checksum_agg, binary_checksum functions, or Apache Spark SQL built-in functions).

I am aware of Rust std::hash, but I dont know if this is best or even applicable.

Consider an in-memory relational-like table (using SQL terms) which has a bunch of rows each row having a number of columns. Let us say each row is a Rust struct or enum or tuple and a group of such rows is represented in-memory as a Rust BTreeMap (to ensure same sort order) where K is the BtreeMap Key and V is a Rust struct/enum/tuple with data columns.

I would like to be able to compute a checksum across all data-fields in V so I can store checksum as part of the V itself (like a hidden-signature/fingerprint column). I want to use checksum to detect if any of V fields is changed because recomputing the checksum would mean a new checksum will not match the stored original checksum.

So I want to use the checksum only for detecting data changes in any field value of the struct/enum/tuple and to be able to simply compare two rows for equality/not-equality (I dont need > greater or < less support, only need == and != support).

My second (related) problem is ability to thus checksum the entire table (entire BTreeMap) with the above Keys and Values(structs) so I can quickly compare the entire table to detect any data changes in any of the rows or to detect if some rows were removed or some new rows were inserted. I want to quickly determine it by comparing two tables "total-checksums", even of row count may match (a delete followed by insert but new row contents are different).

I am not planning to use this for security/crypto purposes, my only use is for data equality and for data-changed detection purposes. The checksum function needs to be inexpensive/fast with very low probability of checksum collisions so that I dont end up with undetected data change (i.e. producing same checksum value for different struct or rows of structs).

Essentially (as a former C-programmer) it is a matter of taking any struct/tuple/enum in its entirety represented as a slice of unsigned bytes [u8], computing a checksum to compare against another instance of same type.

Or perhaps there are other good Rust based solution that you may suggest.

I prefer to use existing checksum-library-crate rather than writing one myself because bit-twiddling is no longer my specialty. LOL.

Many thanks for any guidance.

Sounds like the standard Hash is almost exactly what you're looking for. Only question is whether the 64 bit hash is sufficient to keep collisions at a level you're comfortable with.

1 Like

thanks,

if std::hash turns out, by consensus, to be the best approach, then I will use that.
I guess that would have to implement checksum using std::hash by-hand for any Rust struct/enum/tuple type that I create.

I think this depends greatly on what's ok for collisions.

If you're depending on the hash changing to detect that you need to do an update, then Hash isn't enough, and you'll need a full cryptographic hash.

If it's just an optimization to fast-path things, then sure, Hash is probably great.

1 Like

Thank you. Would you hazard a rough guess of probability of such collisions (undetected changes but checksums do match) using standard Rust library hasher, in some order of percentage points? Is this 1 collision in how many? If its 1 in 100 "wrong" (i.e. 1% collisions) then indeed I would be aiming for a full cryptographic hash (any suggestions, please?) but if it's 1 in a >1 Million then I may be able to live with standard hasher.

They're probably in the awkward middle ground where you'd plausibly never see them in testing, but you'd get a few weird issues a day in the wild if you got popular.

TBH, if you use a non-cryptographic hash, you should probably have a mode for testing where the hash is intentionally bad (like where you %100 the hash to force more collisions) so you can make sure they're handled appropriately.

Note that, for hash table use, occasionally having two things map to the same bucket isn't a problem, as the critical thing is to avoid lots of things mapping to the same bucket. So it's possible that something which works great for hash tables might turn out to be unexpectedly bad for your use case.

But only you can know how bad collisions are for you, and whether your particular use is subject to birthday problems that make problems far more likely.

Any of the ones that have been finalists in the big competitions with plenty of eyes on them are fine. Personally I'm fond of SHA-3 - Wikipedia, the winner of NIST's latest competition, but there are plenty of good ones.

(Mostly it's a political choice: some won't trust some of the competition venues, though of late the entrants are typically all global and open.)

1 Like

Much obliged!
What specific Rust crates can I use for a full cryptographic hash such, as SHA-3 ?
There is on crates.io a sha3 crate but has not been updated for a year. There is also a blake3 (not sha-3) that appears to be suitable and a bunch of others. Its hard to choose.

The blake3 hash is cryptographically secure. All of the hashes in this repository were made by the Rust crypto org, and are hash algorithms written in pure rust. There is also crates like openssl and ring which do not use pure rust hashes but hashes either written in assembly or C/C++. Selecting which one to use is purely your preference (as long as it hasn't been broken). The hashes I'd recommend, in order of preference and speed, however, are blake3, blake2, sha-2, or sha-3. The vast majority of other hashes in the Rust Crypto hashes repository are lesser-known algorithms and whether they're secure or not isn't entirely known, but they are secure as of the time that table was constructed (that is, we don't know of any attacks that've been conducted on them yet).

1 Like

thank you, I have started using the blake3 crate.

To my mind the whole point of using the standard Hash is that you don't need to do that, you can just derive it.

1 Like

If you use a universal hashing Hasher, then you could guarantee 2^-64 collision probability for any pair of inputs, i.e. n * (n-1) / 2^65 upper bound for collision probability of any n inputs.

In a sense that's even better than a crypto-hash, which only makes a heuristic claim, while universal hashing gives a guaranteed upper bound for probability. And it's faster than crypto.

Unfortunately the std Hasher isn't quite a universal hashing Hasher. Not sure if there is a nice crate that implements one. It should be doable to implement a universal hasher that works with any type that implements std::Hash.

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.