Reading 2gb file into memory, and fast lookups

any micro optimization will do, i need this to be as fast as possible, its a simple application. is hashset good for lookups? im searching a giant file

let file_content = fs::read_to_string(PATH)?;

//this takes too long, 2gb
    let items: HashSet<String> = file_content
        .lines()
        .filter(|x| !x.is_empty())
        .map(|x| x.trim_end_matches('\r').to_string())
        .collect();


// this is contained in a threadpool, im generating items and comparing them here, (100% cpu usage) i need this to be optimized

//strg is = to a String, and is to be casted into .as_str() 
if items.contains(&strg) {
                println!("{strg} valid match");
            }

HashSet uses a somewhat expensive DoS-resistant hash algorithm by default. Consider using a faster hash like ahash or rustc-hash, especially if you don’t need to worry about malicious input when building your HashSet.

1 Like

You're reading in the entire 2G file as one String, then allocating another 2G in the form a String per line. Try just reading the input as a String per line to begin with.

Incidentally, both lines methods trim both \n and \r\n so your trim only matters if you have a \r\r\n. (And if you do have those, you'd want to do the trimming before checking for emptiness too.)

Preallocating the HashSet[1] and using extend instead of collect may avoid some reallocations.

Write some benchmarks if you haven't already.


  1. whichever one you use ↩︎

2 Likes

Try just reading the input as a String per line

You might also want to try the opposite: make items borrow &strs from file_content instead of owning individual Strings. That will reduce the allocator overhead (time and space) per string and make the HashMap elements smaller by one usize, but more importantly for lookups, it'll keep all the strings closer to each other in memory, improving the cache locality. (Assuming the strings are short.)

(As always, benchmark to see what difference it makes, rather than assuming any advice is good.)

4 Likes

yeah im doing this, im unsure about getting the pubkey into a str, is it the libraries problem?

// btc puzzle
(STARTING_NUMBER..=ENDING_NUMBER)
        .into_par_iter()
        .for_each(|num| {
            let key = KeyPair::new(num).unwrap();

       // Address from bitcoin crate
            let address = Address::p2pkh(key.pk, Network::Bitcoin);
            let address_str = address.to_string();

            if addresses.contains(&address_str) {
                println!("{address_str} found");
            }
        });


impl KeyPair {
    pub fn new(seed: u64) -> Result<Self> {
        let secp = Secp256k1::new();

        let mut rng = rand::rngs::StdRng::seed_from_u64(seed);

        let sk: SecretKey = SecretKey::new(&mut rng);
        let pk: PublicKey = PublicKey::new(sk.public_key(&secp));

        Ok(KeyPair { pk, sk })
    }
}

You'd have to write your own lines equivalent,[1] but another possibility is inputting the file as bytes and using map.contains(strg.as_bytes()) (to avoid checking the 2G for UTF8 correctness).


  1. handling \r\n is the annoying part ↩︎

2 Likes

And if you want to go more in-depth, perhaps a data structure other than a HashMap is in order.

2 Likes

You're allocating a string for every lookup. If you care about lookup performance, that’s the first thing you should fix.

Compare the address data directly, instead. Is this the Address type you're using? It implements Eq and Hash so you can put it in the hash table directly. That will be much more efficient than using strings at all.

You'll need to convert the strings in the file to addresses, but that should be much faster overall unless most of the file is never needed (and perhaps even then).

4 Likes

If you are just looking up lines, maybe Aho–Corasick algorithm - Wikipedia is what you want.

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.