any micro optimization will do, i need this to be as fast as possible, its a simple application. is hashset good for lookups? im searching a giant file
let file_content = fs::read_to_string(PATH)?;
//this takes too long, 2gb
let items: HashSet<String> = file_content
.lines()
.filter(|x| !x.is_empty())
.map(|x| x.trim_end_matches('\r').to_string())
.collect();
// this is contained in a threadpool, im generating items and comparing them here, (100% cpu usage) i need this to be optimized
//strg is = to a String, and is to be casted into .as_str()
if items.contains(&strg) {
println!("{strg} valid match");
}
HashSet uses a somewhat expensive DoS-resistant hash algorithm by default. Consider using a faster hash like ahash or rustc-hash, especially if you don’t need to worry about malicious input when building your HashSet.
You're reading in the entire 2G file as one String, then allocating another 2G in the form a String per line. Try just reading the input as a String per line to begin with.
Incidentally, both lines methods trim both \n and \r\n so your trim only matters if you have a \r\r\n. (And if you do have those, you'd want to do the trimming before checking for emptiness too.)
You might also want to try the opposite: make items borrow &strs from file_content instead of owning individual Strings. That will reduce the allocator overhead (time and space) per string and make the HashMap elements smaller by one usize, but more importantly for lookups, it'll keep all the strings closer to each other in memory, improving the cache locality. (Assuming the strings are short.)
(As always, benchmark to see what difference it makes, rather than assuming any advice is good.)
You'd have to write your own lines equivalent,[1] but another possibility is inputting the file as bytes and using map.contains(strg.as_bytes()) (to avoid checking the 2G for UTF8 correctness).
You're allocating a string for every lookup. If you care about lookup performance, that’s the first thing you should fix.
Compare the address data directly, instead. Is this the Address type you're using? It implements Eq and Hash so you can put it in the hash table directly. That will be much more efficient than using strings at all.
You'll need to convert the strings in the file to addresses, but that should be much faster overall unless most of the file is never needed (and perhaps even then).