Possible to memoize hash values?

I have a situation where I'm inserting a large, immutable value into multiple HashSet's, and it would be extremely helpful if I could avoid repeatedly computing the value's hash on each insertion. Is there a good way of doing this?

I've been thinking about building a wrapper that implements hash::Hash by hashing the underlying resource once, caching it, then calling write_u64 with the resulting hash each time. Something along these lines (almost certainly broken code):

struct HashMemoizer<V> {
    cached_hash: u64,
    value: V,
}
impl<V> Hash for HashMemoizer<V> {
    fn hash<H>(&self, state: &mut H) where H: Hasher + Default {
        if !self.cached_hash {
            let child_state = <H as Default>::default();
            self.value.hash(child_state);
            self.cached_hash = child_state.finish();
        }
        state.write_u64(self.cached_hash);
    }
}

Another possiblity I considered was building a version of HashSet that accepted explicit hash values, but that seemed like a lot more work.

-Ryan

3 Likes

The more I think about this, the more I think building a custom map/set implementation might actually be the way to go. Something that takes a u64 hash and a value, allowing you to "memoize" the has hash outside the map/set implementation:

let key = vec![1,2,3,4,5];
let mut state = SipHash::default();
key.hash(state);
let hash = state.finish();

let mut set = HashValueSet::new();
set.insert(hash.clone(), value);
set.contains(&hash);
set.remove(&hash);

let mut map = HashValueMap::new();
map.insert(hash.clone(), "value");
map.get(&hash);
map.remove(&hash);

I have a similar situation with HashMap, and in my specific case I don't actually need the map to retain the key value, so a custom implementation could be helpful there too.

Yeah this is definitely an interesting/tricky usecase. It would be very easy to fork std's HashMap to expose this functionality (hashing is done and then basically trusted/passed internally).

I'm not sure if it's the best idea to expose in the std interface, though. iirc it's impossible to cause anything unsafe to happen with a "fake" hash, but it's a big foot-gun, and would introduce a huge "mirror api" of *_with_hash.

Hashmaps are created with a RandomHasher wich generates different hashes for the maps.
You could encapsulate the object in a wrapper that has a value and a precomputed hash and use it in performance relevant cases.

Overriding this is already exposed through the Hasher type argument, though. For instance rustc opts into a faster deterministic hasher for its usecase.

Apologies for reviving an old thread, how would one create a wrapper and precompute the hash? Say if my struct already caches its own hash value, and implements fn hash(&self, &mut Hasher), we still wouldn't be able to make the Hasher finish with the desired value.

In my case it's not because of speed, but because the values that I compute the hash from are thrown away and collisions are acceptable.

You can create a HashMap with a custom BuildHasher that returns a Hasher that doesn't do anything extra in finish().

1 Like

Thanks. It wasn't very intuitive but turns out the custom Hasher needs to derive from Default if you want to use BuildHasherDefault

Curious if you can identify what was unintuitive about it? Is it the way rustdocs are structured? Or the compiler error message?

Note that you don't have to use BuildHasherDefault. You can define your own that doesn't require Default, create an instance of it and use it with, e.g., HashMap in std::collections - Rust.

this is the error message I get if I don't derive Default:

error: no associated item named `default` found for type `std::collections::HashSet<std::boxed::Box<evicter::EvictItem>, std::hash::BuildHasherDefault<evicter::IdentU64Hasher>>` in the current scope
   --> src/evicter/mod.rs:180:32
    |
180 |             let mut item_set = EvictItemSet::default();
    |                                ^^^^^^^^^^^^^^^^^^^^^
    |
    = note: the method `default` exists but the following trait bounds were not satisfied: `std::hash::BuildHasherDefault<evicter::IdentU64Hasher> : std::hash::BuildHasher`

It is mentioned in a sentence in BuildHasherDefault in std::hash - Rust (specifically, "BuildHasherDefault can be used when a type H implements Hasher and Default"), but I wish it's in a more prominent place because I imagine most people just skim over the doc, especially when it comes to auxiliary classes like BuildHasherDefault. For example, maybe at the top where it says:

pub struct BuildHasherDefault<H>(_);

Maybe it should spell out that H needs to be Default + Hasher there.

for reference, this is the analogous Java error:

Gen.java:12: error: type argument Gen.MyHasher is not within bounds of type-variable H
        BuildHasherDefault<MyHasher> b = null;
                           ^
  where H is a type-variable:
    H extends Hasher,Default declared in class Gen.BuildHasherDefault
1 error

for:

class Gen {

    interface Hasher {}
    interface Default{}

    interface BuildHasher {}
    class BuildHasherDefault<H extends Hasher & Default> {}

    class MyHasher implements Hasher {}

    public static void main(String[] args) {
        BuildHasherDefault<MyHasher> b = null;
    }
}

In general, I found that javadoc to be better than rustdoc for skimming (mostly because there's a condensed list of methods near the top vs scattered throughout the page). In this case there are differences because how of the trait bound is declared but the javadoc version is much more readable:

public class Gen.BuildHasherDefault<H extends Gen.Hasher & Gen.Default>

I agree that could be better. In particular, the "note: ..." line does not mention the "missing" Default at all.

To be fair, it's the first sentence in the paragraph, and comes right after the struct declaration :slight_smile:.

The issue is that H doesn't have to implement anything, really - there are several traits implemented for BuildHasherDefault that have no requirements on H. Now, thinking about this, I'm not sure why it was designed that way. To me, it seems like the struct should be pub struct BuildHasherDefault<H: Hasher + Default>(std::marker::PhantomData<H>) because that's pretty much its sole purpose, AFAICT. But I suspect there was some reason it wasn't done that way. Perhaps someone familiar with it can comment.

I don't totally disagree with that, except that's probably the first use case for people just skimming code snippets on rustdoc :stuck_out_tongue:

Yeah, and as I mentioned upthread, I'm not even sure why it was designed that way :confused: