What's the fastest way to store 300 million unique values in a HashSet?

Hey guys,

I have 300+ million unique strings (totaling 8+ GB) and I am storing those in a HashSet for a fast contains check. and it's working fine.

But the inserts are a little slow. When I run the program, it reads all those strings from a file and starts inserting them in the HashSet. But it takes some time to do all the inserts.
My end goal is to have the fastest way to check whether a string exists in the HashSet or not. I have millions of contains checks every hour.

I was wondering if there's a way to speed up the HashSet insertion? or maybe another data type I could use for the use case.

Currently, it takes around 15-20 minutes to read the file and insert all those 300 million strings in the HashSet.

Please note: I am a professional rust programmer yet, quite new and still learning haha :slight_smile:

I would appreciate it if anybody could help me.
Thanks.

This sounds like trying to move house using a Peel P50, and asking how to make it go faster. The problem isn't the P50 is slow, it's that you're using a P50 when you should be using a truck. :stuck_out_tongue:

How are these strings stored, and can you change that? My first instinct for "quick, easy, and likely to work" would be to transfer all those strings into a SQLite database, then use that to perform queries directly. That avoids having to load the entire thing into memory, and SQLite has been rigorously optimised for... (checks) holy cow, it's over 25 years old now!

If you really need to avoid the disk access at runtime, you could try loading the whole database into memory first (which should be however fast file IO is, which you can't really do anything about), then "open" it in place.

If you can't change the input format, try using HashSet::with_capacity to try and avoid re-allocating as you grow the set. I had a program that needed to ingest gigabytes of XML, and it turned out the overwhelming majority of the load time was down to a single Vec repeatedly re-allocating to grow as the data was read in.

As for different data structures, that depends significantly on exactly what the data is, what form is it in, how is it distributed, how you're using it etc. Like I said, I'd try SQLite first.

5 Likes

If you stick with HashSet, another possibility to try is to use something like hashbrown's RawTable and to parallelize the hashing of your 8GB of data, sending each value + hash to some builder thread which will do the actual (serial) insertion.

Another possibility is using some stable hasher and storing the hashes (and again using something like RawTable in the building stage).

5 Likes

Thanks, I will give SQLite a try.
The main reason why I want to use memory is to have fast checks. I can have up to a million string checks in each API call.

Currently, it can perform 1 million checks in less than 5 minutes (single-threaded), which is great. I doubt a database would be able to give a similar performance.
But I will run a test with SQLite + using memory and see where it goes.

Again thanks :smiley:

that Peel P50 line tho :joy: :ok_hand:

Another thought I just had if it really is just about existence checks: if many of the checks end up failing, you might want to look at using a bloom filter to probabilistically discard those tests early before moving on to a full check.

Oh, and another potential avenue: depending on how you're currently storing the strings, and if you want to keep using HashSet, you might (very big emphasis on might) get faster load performance if you use something like serde with a compact binary serialiser like bincode.

5 Likes

Interesting.
I haven't looked at RawTable yet.

Let me check it.

Thanks!

Nice. Bloom filter could do the trick for checks.

I will look into serde. :love_you_gesture:

Thanks a lot, man. :raised_hands:

Sounds like you are running in debug mode. 8 GB for 300 million items means around 25-30 bytes per item.

In this Playground, I insert 1 million 32-byte items in 0.12 s. 300 times that would easily be under a minute. Taking 20+ times more than that means that you are probably not turning on optimizations.

7 Likes

I’ve played around with various ways of constructing a large HashSet of random strings to see how performance is influenced. I didn’t write down any results, but if anyone is interested, here’s the things I’ve tried out

(warning: with the 300_000_000 value for N, running this code may take up over 20GB of RAM – if you don’t have much RAM, be ready to kill it in case it slows down your machine too much)

/*

[dependencies]
hashbrown = { version = "0.14.0", features = ["raw"] }
rand = "0.8.5"
rayon = "1.7.0"
thread_local = "1.1.7"
typed-arena = "2.0.2"

*/

#![allow(unused)]

const N: usize = 300_000_000; // <- smaller sizes are nicer for testing though…
macro_rules! new {
    ($($t:tt)*) => {
        // chose one:

        $($t)*::with_capacity(N)
        // $($t)*::new()
    }
}

fn main() {
    // chose one:

    // main1();
    // main2();
    // main3();
    // main4();
    main5();
}

use rand::{
    distributions::{Alphanumeric, DistString},
    prelude::*,
};
use rayon::prelude::{IntoParallelIterator, ParallelIterator};
use std::{collections::HashSet, hash::BuildHasher};
use thread_local::ThreadLocal;
use typed_arena::Arena;

fn main1() {
    let mut map = new!(HashSet);
    for i in 0..N {
        if i % 1_000_000 == 0 {
            println!("{i}");
        }
        map.insert(string1());
    }
    dbg!(map.len());
}

fn string1() -> String {
    let mut g = rand::thread_rng();
    let len = g.gen_range(5..50);
    Alphanumeric.sample_string(&mut g, len)
}
fn string2(s: &mut String) {
    s.clear();
    let mut g = rand::thread_rng();
    let len = g.gen_range(5..50);
    Alphanumeric.append_string(&mut g, s, len)
}
fn string3<'a>(a: &'a Arena<u8>, s: &mut String) -> &'a str {
    s.clear();
    let mut g = rand::thread_rng();
    let len = g.gen_range(5..50);
    Alphanumeric.append_string(&mut g, s, len);
    a.alloc_str(s)
}

fn main2() {
    let arena = Arena::new();
    let mut map = new!(HashSet);
    let mut buf = String::new();
    for i in 0..N {
        if i % 1_000_000 == 0 {
            println!("{i}");
        }
        map.insert(string3(&arena, &mut buf));
    }
    dbg!(map.len());
}

fn main3() {
    let mut map = new!(HashSet);
    let (tx, rx) = std::sync::mpsc::channel();

    std::thread::spawn(move || {
        (0..N).into_par_iter().for_each(|_| {
            tx.send(string1()).unwrap();
        })
    });
    for (i, s) in rx.into_iter().enumerate() {
        if i % 1_000_000 == 0 {
            println!("{i}");
        }
        map.insert(s);
    }

    dbg!(map.len());
}

fn main4() {
    let mut map = new!(hashbrown::HashMap);
    let (tx, rx) = std::sync::mpsc::channel();
    let hasher = map.hasher().clone();
    std::thread::scope(|scope| {
        scope.spawn(move || {
            (0..N).into_par_iter().for_each(|_| {
                let s = string1();
                tx.send((hasher.hash_one(&s), s)).unwrap();
            })
        });
        for (i, (h, s)) in rx.into_iter().enumerate() {
            if i % 1_000_000 == 0 {
                println!("{i}");
            }
            let table = map
                .raw_entry_mut()
                .from_key_hashed_nocheck(h, &s)
                .insert(s, ());
        }
    });
    let map: hashbrown::HashSet<_> = map.into();

    dbg!(map.len());
}

fn main5() {
    let arena = &ThreadLocal::new();
    let mut map = new!(hashbrown::HashMap);
    let (tx, rx) = std::sync::mpsc::channel();
    let hasher = map.hasher().clone();
    std::thread::scope(|scope| {
        scope.spawn(move || {
            (0..N)
                .into_par_iter()
                .for_each_with(String::new(), |buf, _| {
                    let s = string3(arena.get_or_default(), buf);
                    tx.send((hasher.hash_one(&s), s)).unwrap();
                })
        });
        for (i, (h, s)) in rx.into_iter().enumerate() {
            if i % 1_000_000 == 0 {
                println!("{i}");
            }
            let table = map
                .raw_entry_mut()
                .from_key_hashed_nocheck(h, &s)
                .insert(s, ());
        }
    });
    let map: hashbrown::HashSet<_> = map.into();

    dbg!(map.len());
}

A relevant take-away is that - yet - separating and parallelizing the hashing can make a difference, but also creating the strings is a significant portion of the effort, so don’t expect too crazy improvements (or e.g. you could benchmark for your use-case how long the program takes if instead of inserting into a map, you simply throw away the values, to get a lower bound of how long everything besides creating the map takes).

Also, on my laptop, the time scales never came close to even 10 minutes; it was more like 3 minutes for the worst case naive approach, and 1 minute for the best case I had. So, as @H2CO3 just pointed out, too, if you haven’t already, make sure that optimizations are actually turned on in your program when you measure it’s run time :slight_smile: Also keep an eye out for your RAM usage; if the program starts swapping, that might slow it down a lot, too.

13 Likes

Your numbers are probably way off. Here's an SQLite database doing 1 million checks (50-50% hits and misses) in 3 seconds.

7 Likes

If the strings are read from a file on disk, you could skip most of the overhead using a memmap:

use std::collections::HashSet;
use std::fs::File;
use memmap2::Mmap;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let file = File::open("custom_1988_2020.csv")?;
    let memmap = unsafe { Mmap::map(&file)? };
    let mut set: HashSet<&[u8]> = HashSet::with_capacity(200_000_000);
    memmap.split(|c| *c == b'\n')
        .for_each(|v| {
            set.insert(v);
        });

    println!("length: {}", set.len());
    Ok(())
}

I was surprised my maximum insert rate was only about 250k entries per second on a M1 Apple though

If you only want the cardinality, or if a low error rate for false positives checking whether a string is already in the set, HyperLogLog is probably a much better data structure to use.

It’s kind of like a bloom filter, but is designed for statistically accurate cardinality estimations with very large sets using a tiny amount of memory at runtime. You can serialize the state to disk in a few hundred KiB, so it can load instantly on start up.

1 Like

Might also be worth considering Redis or a memcached cluster particularly if the data will get bigger than the memory of a single machine.

Just to pile on one more idea (because I've had success with that on a slightly smaller number of entries): If creating your data structure doesn't need to be fast, maybe you could put it in a sorted vec? Then you could look up via binary search, which can be surprisingly fast.

2 Likes

I thought hyperloglog only gave you an estimate of how many total unique items there are. I've never heard of it being used in leu of a bloom filter before...

The insert operation is conditional (update the bucket only if the value is greater than what already exists). This condition can be exposed as a new operation that can answer whether an item has never been seen before. The answer may have false positives, I.e “this item could have been seen before”.

I don’t think I’ve ever seen an analysis of the error rate for false positives. It might be too high to be practical.

But anyway, this should all sound a lot like a bloom filter.

I remembered an older blog post that was mentioned here when reading this thread. It's about modeling string sets as FST: Index 1,600,000,000 Keys with Automata and Rust - Andrew Gallant's Blog

Maybe this could also be used for your problem

1 Like

This would be n log n to build the vec and log n for each access while a hashmap is n for building and has constant complexity for lookup. The only case where the vec might be faster is for sorted lookups because of CPU caching.

Maybe this is one case of RocksDB or sled.

Well, sometimes big O isn't the only thing that factors in, constants can matter (as might locality, as you said). As mentioned, I had such a case where for several million of entries the Vec was just faster, by a serious amount.