Building hash map is cheaper than doing lookups?

I published a quick hashmap benchmark here:

https://github.com/matklad/hashbench

fn main() {
    let n = 50_000_000;
    let hm1 = {
        let _t = timeit("build std::collections::HashSet");
        numbers(n).collect::<std::collections::HashSet<_>>()
    };
    let hm2 = {
        let _t = timeit("build rustc_hash::FxHashSet");
        numbers(n).collect::<rustc_hash::FxHashSet<_>>()
    };
    let hm3 = {
        let _t = timeit("build ahash::HashSet");
        numbers(n).collect::<ahash::AHashSet<_>>()
    };
    assert!(hm1.len() == hm2.len());
    assert!(hm2.len() == hm3.len());
    eprintln!();

    {
        let _t = timeit("lookup std::collections::HasSet");
        assert!(numbers(n).all(|it| hm1.contains(&it)))
    }
    {
        let _t = timeit("lookup rustc_hash::FxHashSet");
        assert!(numbers(n).all(|it| hm2.contains(&it)))
    }
    {
        let _t = timeit("lookup ahash::AHashSet");
        assert!(numbers(n).all(|it| hm3.contains(&it)))
    }
}

Here's what the output looks like (checked on two cpu)

build std::collections::HashSet   4.35s *
build rustc_hash::FxHashSet       3.76s
build ahash::HashSet              3.92s

lookup std::collections::HasSet   5.33s *
lookup rustc_hash::FxHashSet      1.89s
lookup ahash::AHashSet            2.78s

Can anyone explain the * entries? How come lookup is costlier than construction?

I just added

[profile.release]
lto = "fat"
codegen-units = 1

to the Cargo.toml which changes benchmark results completely:

build std::collections::HashSet: 7.19s
build rustc_hash::FxHashSet:     7.44s
build ahash::HashSet:            7.61s

lookup std::collections::HasSet: 6.26s
lookup rustc_hash::FxHashSet:    3.52s
lookup ahash::AHashSet:          3.69s

Maybe std components are inlined differently than custom dependencies.

2 Likes

I got a different result, even without modifying the Cargo.toml

build std::collections::HashSet   7.52s
build rustc_hash::FxHashSet       5.13s
build ahash::HashSet              4.85s

lookup std::collections::HasSet   6.80s
lookup rustc_hash::FxHashSet      2.77s
lookup ahash::AHashSet            4.38s

for completion, my results with lto:

build std::collections::HashSet   3.50s
build rustc_hash::FxHashSet       1.95s
build ahash::HashSet              2.11s

lookup std::collections::HasSet   5.11s
lookup rustc_hash::FxHashSet      1.63s
lookup ahash::AHashSet            1.91s

leaving out codegen-units = 1 changes benchmarks again for me:

build std::collections::HashSet: 7.83s
build rustc_hash::FxHashSet:     7.64s
build ahash::HashSet:            7.65s

lookup std::collections::HasSet: 9.06s
lookup rustc_hash::FxHashSet:    3.43s
lookup ahash::AHashSet:          3.73s

So apparently codegen-units = 1 is important on my side to make benchmarks behave as expected.

1 Like

How come lookup is costlier than construction?

Cache locality? While it grows, half of the time you operate on a hash table that is half the size. When doing lookups, you operate on the full-size table only.

4 Likes

collect should allocate the full capacity up front, so I don't think that's the explanation. But it might be that operating on a mostly-empty HashSet is faster in some cases than operating on a mostly-full one.

You still need to insert all the elements in the table of the final size, so cache locality can't be the explanation.

This is plausible in theory! But doesn't seem to be the case: if I pre-allocate hash maps to be 4*n in size, so that they are mostly empty, I still see the effect.

They are all the same version of std's HashSet (using hashbrown), just with different hash functions, is that right? Fwiw indexmap 1.6 follows std HashSet's trend and indexmap 1.4 does not (last version not based on hashbrown).

In this case (hashbrown) it actually doesn't, betting a whole half-relocation cost that at least half the entries may be duplicates: map.rs.html -- source

But that still doesn't answer the OP :grinning_face_with_smiling_eyes:

1 Like

When starting from an empty set, as HashSet::from_iter does, it reserves the full capacity, not half.

2 Likes

[EDITED; somehow I was blind]

You're right, my bad

I imagine that inserting into a mostly empty hash set is pretty cheap, whereas looking up always goes into a pretty full hash map, which will require more comparisons, right?

I did a little profiling, and found that contains_key has two clear hotspots. Here's the std hasher:

       │     hashbrown::raw::RawTable<T>::probe_seq:
       │       and       %rdx,%rcx
       │     <hashbrown::raw::ProbeSeq as core::iter::traits::iterator::Iterator>::next:
       │       lea       0x10(%rcx),%rdi
  0.00 │       and       %rdx,%rdi
       │     core::intrinsics::copy_nonoverlapping:
  0.03 │       movdqu    (%r11,%rcx,1),%xmm1
       │     core::core_arch::simd::i8x16::new:
 28.65 │       movd      %eax,%xmm0
...
       │     core::cmp::impls::<impl core::cmp::PartialEq for u32>::eq:
  0.01 │       cmp       -0x4(%rsi),%r10d
       │     hashbrown::raw::RawTable<T>::find:
 65.40 │     → je        8e5a <hashbrown::map::HashMap<K,V,S>::contains_key+0x27a>

Ryzen counters are imprecise, allowing skid before the perf sampling interrupt, so I think the real blame is probably on the instructions before, movdqu (%r11,%rcx,1),%xmm1 and cmp -0x4(%rsi),%r10d, which both access memory.

Regardless, the first part corresponds to the SIMD search for the partial hash (7 bits), and the second is the actual item comparison after that partial hash has been matched.

The fx and ahash versions have the same hotspots in different proportions:

       │     hashbrown::raw::RawTable<T>::probe_seq:                                                
       │       and       %rbx,%rcx                                                                  
       │     <hashbrown::raw::ProbeSeq as core::iter::traits::iterator::Iterator>::next:            
  0.47 │       lea       0x10(%rcx),%rdi                                                            
       │       and       %rbx,%rdi                                                                  
       │     core::intrinsics::copy_nonoverlapping:                                                 
  0.01 │       movdqu    (%r11,%rcx,1),%xmm1                                                        
       │     core::core_arch::simd::i8x16::new:                                                     
  9.20 │       movd      %eax,%xmm0                                                                 
...
       │     core::cmp::impls::<impl core::cmp::PartialEq for u32>::eq:                             
       │       cmp       -0x4(%rdx),%r8d                                                            
       │     hashbrown::raw::RawTable<T>::find:                                                     
 87.26 │     → jne       8f2b <hashbrown::map::HashMap<K,V,S>::contains_key+0xbb>                   
       │     hashbrown::raw::RawTable<T>::probe_seq:                                                
       │       mov       %rcx,%rsi                                                                  
       │       and       %rax,%rsi                                                                  
       │     hashbrown::raw::h2:                                                                    
       │       shr       $0x39,%rax                                                                 
       │     <hashbrown::raw::ProbeSeq as core::iter::traits::iterator::Iterator>::next:            
       │       lea       0x10(%rsi),%rdi                                                            
       │       and       %rcx,%rdi                                                                  
       │     core::intrinsics::copy_nonoverlapping:                                                 
       │       movdqu    (%r10,%rsi,1),%xmm1                                                        
       │     core::core_arch::simd::i8x16::new:                                                     
  6.98 │       movd      %eax,%xmm0                                                                 
...
       │     core::cmp::impls::<impl core::cmp::PartialEq for u32>::eq:                             
  0.02 │       cmp       -0x4(%rdx),%r12d                                                           
       │     hashbrown::raw::RawTable<T>::find:                                                     
 89.30 │     → je        90fc <hashbrown::map::HashMap<K,V,S>::contains_key+0x15c>                  

Here are the raw sample numbers for each hasher on the two hotspots:

  • std: 5991 and 13684
  • fx: 781 and 7413
  • ahash: 758 and 9701

Wild guess, maybe the std hasher is not actually distributing bits in the hash as well as the others? I'm not sure if there are any hooks to get real statistics on this.

Note that the hotspots are not on the construction of the hash value, which is where we probably would have expected to see the difference between hashers.

3 Likes

I think that's true -- early insertions have less likelihood of hash collisions than the later ones, whereas the lookup phase should have about the same likelihood of collisions for all items. However, that effect should be about the same for every hasher, if they're all well distributed.

That's interesting. Might make sense to vary the numbers function. It's (i*i) ^ i in my bench, using i*i*i gives something similar. But just i is very curious:

build rustc_hash::FxHashSet       2.22s
build std::collections::HashSet   3.98s
build ahash::HashSet              2.52s

lookup rustc_hash::FxHashSet      5.52s
lookup std::collections::HasSet   5.68s
lookup ahash::AHashSet            5.58s

To test the hash bits, I tried this function with the hasher() from each map:

fn print_7bit_histogram(hash_builder: &impl BuildHasher, input: impl Iterator<Item = u32>) {
    let mut hist = [0usize; 128];
    for val in input {
        let mut state = hash_builder.build_hasher();
        val.hash(&mut state);
        let hash = state.finish();
        hist[(hash >> (64 - 7)) as usize] += 1;
    }
    eprintln!("{:?}", hist);
}

All the hashers look fairly uniform, around 390k in each bin. (expected 50M / 128 = 390,625)

However, I also noticed that if I change that to use the lower 7 bits, hash & 127, fx looks bad -- exactly 781,250 in every even slot and 0 in the odds. (Note that (i*i)^i is always even too!) This would affect the initial bin used for probing the hash map, but this doesn't seem to have hurt its performance overall.

1 Like

Minimized the example to just FxHash:

use std::time::Instant;

const N: u32 = 50_000_000;

#[inline(never)]
fn build() -> rustc_hash::FxHashSet<u32> {
    numbers(N).collect()
}

#[inline(never)]
fn lookup(hm: &rustc_hash::FxHashSet<u32>) {
    assert!(numbers(N).all(|it| hm.contains(&it)))
}

fn main() {
    let hm = {
        let _t = timeit("build rustc_hash::FxHashSet");
        build()
    };

    {
        let _t = timeit("lookup rustc_hash::FxHashSet");
        lookup(&hm)
    }
}

fn numbers(n: u32) -> impl Iterator<Item = u32> {
    (0..n)
        // uncomment rev to make lookup slow
        // .rev()
        .map(|i| i)
}

fn timeit(label: &'static str) -> impl Drop {
    struct Timer(&'static str, Instant);
    impl Drop for Timer {
        fn drop(&mut self) {
            eprintln!("{:<33} {:.2?}", self.0, self.1.elapsed());
        }
    }
    Timer(label, Instant::now())
}

The perf stat output for this is interesting:

λ perf stat ./target/release/hashbench
build rustc_hash::FxHashSet       1.77s
lookup rustc_hash::FxHashSet      5.25s

 Performance counter stats for './target/release/hashbench':

          7,023.46 msec task-clock:u              #    1.000 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
            81,992      page-faults:u             #    0.012 M/sec                  
    33,287,301,585      cycles:u                  #    4.739 GHz                    
         8,799,824      stalled-cycles-frontend:u #    0.03% frontend cycles idle   
            38,151      stalled-cycles-backend:u  #    0.00% backend cycles idle    
     5,150,625,526      instructions:u            #    0.15  insn per cycle         
                                                  #    0.00  stalled cycles per insn
       600,200,738      branches:u                #   85.457 M/sec                  
            95,657      branch-misses:u           #    0.02% of all branches        

       7.023845859 seconds time elapsed

       6.963747000 seconds user
       0.059997000 seconds sys

vs

λ perf stat ./target/release/hashbench
build rustc_hash::FxHashSet       1.73s
lookup rustc_hash::FxHashSet      1.43s

 Performance counter stats for './target/release/hashbench':

          3,162.77 msec task-clock:u              #    1.000 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
            81,990      page-faults:u             #    0.026 M/sec                  
    15,406,399,075      cycles:u                  #    4.871 GHz                    
         8,110,604      stalled-cycles-frontend:u #    0.05% frontend cycles idle   
           124,331      stalled-cycles-backend:u  #    0.00% backend cycles idle    
     5,150,621,328      instructions:u            #    0.33  insn per cycle         
                                                  #    0.00  stalled cycles per insn
       600,196,750      branches:u                #  189.769 M/sec                  
            91,812      branch-misses:u           #    0.02% of all branches        

       3.163312268 seconds time elapsed

       3.102104000 seconds user
       0.061002000 seconds sys

The only metric that differs is cycles and time, the number of executed instructions is the same. So we are just executing instructions closer that we'd otherwise do. CPU fails to paralelize consecutive runs of the loop?

Seems to be this: collecting into an intermediate Vec fixes rev-sensitivity:

fn numbers(n: u32) -> impl Iterator<Item = u32> {
    (0..n)
        // uncomment rev to make lookup slow
        // .rev()
        .map(|i| i)
        .collect::<Vec<u32>>()
        .into_iter()
}

Published the repro here.

I still don't understand what's going on, let's try summoning @Amanieu :slight_smile: