Building hash map is cheaper than doing lookups?

matklad · February 2, 2021, 3:13pm

I published a quick hashmap benchmark here:

fn main() {
    let n = 50_000_000;
    let hm1 = {
        let _t = timeit("build std::collections::HashSet");
        numbers(n).collect::<std::collections::HashSet<_>>()
    };
    let hm2 = {
        let _t = timeit("build rustc_hash::FxHashSet");
        numbers(n).collect::<rustc_hash::FxHashSet<_>>()
    };
    let hm3 = {
        let _t = timeit("build ahash::HashSet");
        numbers(n).collect::<ahash::AHashSet<_>>()
    };
    assert!(hm1.len() == hm2.len());
    assert!(hm2.len() == hm3.len());
    eprintln!();

    {
        let _t = timeit("lookup std::collections::HasSet");
        assert!(numbers(n).all(|it| hm1.contains(&it)))
    }
    {
        let _t = timeit("lookup rustc_hash::FxHashSet");
        assert!(numbers(n).all(|it| hm2.contains(&it)))
    }
    {
        let _t = timeit("lookup ahash::AHashSet");
        assert!(numbers(n).all(|it| hm3.contains(&it)))
    }
}

Here's what the output looks like (checked on two cpu)

build std::collections::HashSet   4.35s *
build rustc_hash::FxHashSet       3.76s
build ahash::HashSet              3.92s

lookup std::collections::HasSet   5.33s *
lookup rustc_hash::FxHashSet      1.89s
lookup ahash::AHashSet            2.78s

Can anyone explain the * entries? How come lookup is costlier than construction?

Robbepop · February 2, 2021, 3:37pm

I just added

[profile.release]
lto = "fat"
codegen-units = 1

to the Cargo.toml which changes benchmark results completely:

build std::collections::HashSet: 7.19s
build rustc_hash::FxHashSet:     7.44s
build ahash::HashSet:            7.61s

lookup std::collections::HasSet: 6.26s
lookup rustc_hash::FxHashSet:    3.52s
lookup ahash::AHashSet:          3.69s

Maybe std components are inlined differently than custom dependencies.

SkiFire13 · February 2, 2021, 3:38pm

I got a different result, even without modifying the Cargo.toml

build std::collections::HashSet   7.52s
build rustc_hash::FxHashSet       5.13s
build ahash::HashSet              4.85s

lookup std::collections::HasSet   6.80s
lookup rustc_hash::FxHashSet      2.77s
lookup ahash::AHashSet            4.38s

matklad · February 2, 2021, 3:48pm

for completion, my results with lto:

build std::collections::HashSet   3.50s
build rustc_hash::FxHashSet       1.95s
build ahash::HashSet              2.11s

lookup std::collections::HasSet   5.11s
lookup rustc_hash::FxHashSet      1.63s
lookup ahash::AHashSet            1.91s

Robbepop · February 2, 2021, 3:51pm

leaving out codegen-units = 1 changes benchmarks again for me:

build std::collections::HashSet: 7.83s
build rustc_hash::FxHashSet:     7.64s
build ahash::HashSet:            7.65s

lookup std::collections::HasSet: 9.06s
lookup rustc_hash::FxHashSet:    3.43s
lookup ahash::AHashSet:          3.73s

So apparently codegen-units = 1 is important on my side to make benchmarks behave as expected.

kornel · February 2, 2021, 3:52pm

How come lookup is costlier than construction?

Cache locality? While it grows, half of the time you operate on a hash table that is half the size. When doing lookups, you operate on the full-size table only.

mbrubeck · February 2, 2021, 3:55pm

collect should allocate the full capacity up front, so I don't think that's the explanation. But it might be that operating on a mostly-empty HashSet is faster in some cases than operating on a mostly-full one.

matklad · February 2, 2021, 4:37pm

You still need to insert all the elements in the table of the final size, so cache locality can't be the explanation.

matklad · February 2, 2021, 4:39pm

This is plausible in theory! But doesn't seem to be the case: if I pre-allocate hash maps to be 4*n in size, so that they are mostly empty, I still see the effect.

bluss · February 2, 2021, 4:53pm

They are all the same version of std's HashSet (using hashbrown), just with different hash functions, is that right? Fwiw indexmap 1.6 follows std HashSet's trend and indexmap 1.4 does not (last version not based on hashbrown).

Yandros · February 2, 2021, 5:02pm

In this case (hashbrown) it actually doesn't, betting a whole half-relocation cost that at least half the entries may be duplicates: map.rs.html -- source

But that still doesn't answer the OP

mbrubeck · February 2, 2021, 5:04pm

When starting from an empty set, as HashSet::from_iter does, it reserves the full capacity, not half.

Yandros · February 2, 2021, 5:08pm

[EDITED; somehow I was blind]

You're right, my bad

droundy · February 2, 2021, 5:27pm

I imagine that inserting into a mostly empty hash set is pretty cheap, whereas looking up always goes into a pretty full hash map, which will require more comparisons, right?

cuviper · February 2, 2021, 6:19pm

I did a little profiling, and found that contains_key has two clear hotspots. Here's the std hasher:

       │     hashbrown::raw::RawTable<T>::probe_seq:
       │       and       %rdx,%rcx
       │     <hashbrown::raw::ProbeSeq as core::iter::traits::iterator::Iterator>::next:
       │       lea       0x10(%rcx),%rdi
  0.00 │       and       %rdx,%rdi
       │     core::intrinsics::copy_nonoverlapping:
  0.03 │       movdqu    (%r11,%rcx,1),%xmm1
       │     core::core_arch::simd::i8x16::new:
 28.65 │       movd      %eax,%xmm0
...
       │     core::cmp::impls::<impl core::cmp::PartialEq for u32>::eq:
  0.01 │       cmp       -0x4(%rsi),%r10d
       │     hashbrown::raw::RawTable<T>::find:
 65.40 │     → je        8e5a <hashbrown::map::HashMap<K,V,S>::contains_key+0x27a>

Ryzen counters are imprecise, allowing skid before the perf sampling interrupt, so I think the real blame is probably on the instructions before, movdqu (%r11,%rcx,1),%xmm1 and cmp -0x4(%rsi),%r10d, which both access memory.

Regardless, the first part corresponds to the SIMD search for the partial hash (7 bits), and the second is the actual item comparison after that partial hash has been matched.

The fx and ahash versions have the same hotspots in different proportions:

       │     hashbrown::raw::RawTable<T>::probe_seq:                                                
       │       and       %rbx,%rcx                                                                  
       │     <hashbrown::raw::ProbeSeq as core::iter::traits::iterator::Iterator>::next:            
  0.47 │       lea       0x10(%rcx),%rdi                                                            
       │       and       %rbx,%rdi                                                                  
       │     core::intrinsics::copy_nonoverlapping:                                                 
  0.01 │       movdqu    (%r11,%rcx,1),%xmm1                                                        
       │     core::core_arch::simd::i8x16::new:                                                     
  9.20 │       movd      %eax,%xmm0                                                                 
...
       │     core::cmp::impls::<impl core::cmp::PartialEq for u32>::eq:                             
       │       cmp       -0x4(%rdx),%r8d                                                            
       │     hashbrown::raw::RawTable<T>::find:                                                     
 87.26 │     → jne       8f2b <hashbrown::map::HashMap<K,V,S>::contains_key+0xbb>

       │     hashbrown::raw::RawTable<T>::probe_seq:                                                
       │       mov       %rcx,%rsi                                                                  
       │       and       %rax,%rsi                                                                  
       │     hashbrown::raw::h2:                                                                    
       │       shr       $0x39,%rax                                                                 
       │     <hashbrown::raw::ProbeSeq as core::iter::traits::iterator::Iterator>::next:            
       │       lea       0x10(%rsi),%rdi                                                            
       │       and       %rcx,%rdi                                                                  
       │     core::intrinsics::copy_nonoverlapping:                                                 
       │       movdqu    (%r10,%rsi,1),%xmm1                                                        
       │     core::core_arch::simd::i8x16::new:                                                     
  6.98 │       movd      %eax,%xmm0                                                                 
...
       │     core::cmp::impls::<impl core::cmp::PartialEq for u32>::eq:                             
  0.02 │       cmp       -0x4(%rdx),%r12d                                                           
       │     hashbrown::raw::RawTable<T>::find:                                                     
 89.30 │     → je        90fc <hashbrown::map::HashMap<K,V,S>::contains_key+0x15c>

Here are the raw sample numbers for each hasher on the two hotspots:

std: 5991 and 13684
fx: 781 and 7413
ahash: 758 and 9701

Wild guess, maybe the std hasher is not actually distributing bits in the hash as well as the others? I'm not sure if there are any hooks to get real statistics on this.

Note that the hotspots are not on the construction of the hash value, which is where we probably would have expected to see the difference between hashers.

cuviper · February 2, 2021, 6:28pm

I think that's true -- early insertions have less likelihood of hash collisions than the later ones, whereas the lookup phase should have about the same likelihood of collisions for all items. However, that effect should be about the same for every hasher, if they're all well distributed.

matklad · February 2, 2021, 6:34pm

That's interesting. Might make sense to vary the numbers function. It's (i*i) ^ i in my bench, using i*i*i gives something similar. But just i is very curious:

build rustc_hash::FxHashSet       2.22s
build std::collections::HashSet   3.98s
build ahash::HashSet              2.52s

lookup rustc_hash::FxHashSet      5.52s
lookup std::collections::HasSet   5.68s
lookup ahash::AHashSet            5.58s

cuviper · February 2, 2021, 7:34pm

To test the hash bits, I tried this function with the hasher() from each map:

fn print_7bit_histogram(hash_builder: &impl BuildHasher, input: impl Iterator<Item = u32>) {
    let mut hist = [0usize; 128];
    for val in input {
        let mut state = hash_builder.build_hasher();
        val.hash(&mut state);
        let hash = state.finish();
        hist[(hash >> (64 - 7)) as usize] += 1;
    }
    eprintln!("{:?}", hist);
}

All the hashers look fairly uniform, around 390k in each bin. (expected 50M / 128 = 390,625)

However, I also noticed that if I change that to use the lower 7 bits, hash & 127, fx looks bad -- exactly 781,250 in every even slot and 0 in the odds. (Note that (i*i)^i is always even too!) This would affect the initial bin used for probing the hash map, but this doesn't seem to have hurt its performance overall.

matklad · February 3, 2021, 9:04am

Minimized the example to just FxHash:

use std::time::Instant;

const N: u32 = 50_000_000;

#[inline(never)]
fn build() -> rustc_hash::FxHashSet<u32> {
    numbers(N).collect()
}

#[inline(never)]
fn lookup(hm: &rustc_hash::FxHashSet<u32>) {
    assert!(numbers(N).all(|it| hm.contains(&it)))
}

fn main() {
    let hm = {
        let _t = timeit("build rustc_hash::FxHashSet");
        build()
    };

    {
        let _t = timeit("lookup rustc_hash::FxHashSet");
        lookup(&hm)
    }
}

fn numbers(n: u32) -> impl Iterator<Item = u32> {
    (0..n)
        // uncomment rev to make lookup slow
        // .rev()
        .map(|i| i)
}

fn timeit(label: &'static str) -> impl Drop {
    struct Timer(&'static str, Instant);
    impl Drop for Timer {
        fn drop(&mut self) {
            eprintln!("{:<33} {:.2?}", self.0, self.1.elapsed());
        }
    }
    Timer(label, Instant::now())
}

The perf stat output for this is interesting:

λ perf stat ./target/release/hashbench
build rustc_hash::FxHashSet       1.77s
lookup rustc_hash::FxHashSet      5.25s

 Performance counter stats for './target/release/hashbench':

          7,023.46 msec task-clock:u              #    1.000 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
            81,992      page-faults:u             #    0.012 M/sec                  
    33,287,301,585      cycles:u                  #    4.739 GHz                    
         8,799,824      stalled-cycles-frontend:u #    0.03% frontend cycles idle   
            38,151      stalled-cycles-backend:u  #    0.00% backend cycles idle    
     5,150,625,526      instructions:u            #    0.15  insn per cycle         
                                                  #    0.00  stalled cycles per insn
       600,200,738      branches:u                #   85.457 M/sec                  
            95,657      branch-misses:u           #    0.02% of all branches        

       7.023845859 seconds time elapsed

       6.963747000 seconds user
       0.059997000 seconds sys

vs

λ perf stat ./target/release/hashbench
build rustc_hash::FxHashSet       1.73s
lookup rustc_hash::FxHashSet      1.43s

 Performance counter stats for './target/release/hashbench':

          3,162.77 msec task-clock:u              #    1.000 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
            81,990      page-faults:u             #    0.026 M/sec                  
    15,406,399,075      cycles:u                  #    4.871 GHz                    
         8,110,604      stalled-cycles-frontend:u #    0.05% frontend cycles idle   
           124,331      stalled-cycles-backend:u  #    0.00% backend cycles idle    
     5,150,621,328      instructions:u            #    0.33  insn per cycle         
                                                  #    0.00  stalled cycles per insn
       600,196,750      branches:u                #  189.769 M/sec                  
            91,812      branch-misses:u           #    0.02% of all branches        

       3.163312268 seconds time elapsed

       3.102104000 seconds user
       0.061002000 seconds sys

The only metric that differs is cycles and time, the number of executed instructions is the same. So we are just executing instructions closer that we'd otherwise do. CPU fails to paralelize consecutive runs of the loop?

matklad · February 3, 2021, 9:21am

Seems to be this: collecting into an intermediate Vec fixes rev-sensitivity:

fn numbers(n: u32) -> impl Iterator<Item = u32> {
    (0..n)
        // uncomment rev to make lookup slow
        // .rev()
        .map(|i| i)
        .collect::<Vec<u32>>()
        .into_iter()
}

Published the repro here.

I still don't understand what's going on, let's try summoning @Amanieu

Topic		Replies	Views
Fastest (lookup time) map for short keys help	8	4359	January 12, 2023
Speeding up or finding alternative to BTreeSet help	19	557	April 2, 2023
Hash-rs needs cpus and algorithms!	19	2498	January 12, 2023
Simple fast hashing	4	2838	January 12, 2023
FxHash blog by Nicholas Nethercote "A brutally effective hash function in Rust"	20	3443	March 9, 2022

Building hash map is cheaper than doing lookups?

Related Topics