Fast Key Lookup - HashMap with Regex?

Hello,

I have a vector of custom structs and one of the fields is a key in fixed string format (by fixed I mean no wildcards etc). There are about 50,000 structs or elements in the vector.

struct Item {
field1 : String,
field2: String,
key: String,
}

let mut items : Vec = Vec::with_capacity(50_000);
//then push each Item into items

In my program I am reading in a table from a csv file and I am wondering what the most efficient way of taking the key field from each of the Item structs above and finding a match in my csv table. The csv table also has a string key column but this time, the string key contains wildcards.

I was thinking of reading the csv table into a HashMap with the key field as the HashMap key and the value being a struct composed from the fields of each row in the csv file:

struct Row {
field1: String,
field2: String,
key : String //this string field has regex characters and wildcards etc.
}

let mut rows : HashMap<String, Row> = HashMap::with_capacity(80_000);
//then push each Row into rows

The reason I was thinking of using a HashMap is for speed of lookup. The only problem is that, I believe a HashMap looks for an exact match of key.

If I take an example of where the key of a particular Item is "abc123" and the key of a row in the HashMap is "ab[cd]123", I would like this to be a match. But I am not sure how I can get the HashMap to match on regexes.

for item in &items {

printlin!("{:?}, &rows.get(item.key)) // "abc123" will not match string literal "ab[cd]123

}

Does anyone have any thoughts or ideas? I am not too sure how I can build regex into the HashMap key or perhaps there is another/better way of searching?

I am trying to avoid a brute force for loop where I take each item in a vector of 50_000 items and then search through a vector of 80_000 rows to find a match. That seems very inefficient!

I hope this question makes sense!

What about using a Finite State Transducer. This is a specialised data structure similar to a Trie that represents the collection as a big finite state machine with each character of a key representing a state. There's a really good quality Rust implementation that supports searching via a regex or string similarity:

The author, @BurntSushi, also wrote an article explaining how it works and the problem they were trying to solve.

Don't forget that computers are fast! Doing a linear scan through 50,000 rows may only take a couple milliseconds.

It'd be even faster if you replace the String fields with an &str that borrows from the original text (i.e. after you read the entire thing into memory with std::fs::read_to_string()). That's because the data you are searching will be right next to each other in memory (good for cache and the prefetcher), which gives you massive performance improvements.

1 Like

The Hash in HashMap means, that hashes are generated for keys, which correspond to u64 (IIRC; definitely an integer, tho) for fast comparison.

If you need to search for several keys, you either have to check for all possible keys one by one or you need something specialized for your use-case. If you only need the HashMap to search for regexes, then instead of storing HashMap<K, V>, you store HashMap<R, Vec<V>> where R is the regex, i.e. you hash the regex instead of the regular key. If you have to store the same value several times under different regexes, you may need HashMap<R, Vec<Rc<T>>>, instead, unless copying/cloning T is an option.

If you need the HashMap for both regular keys and regexes then the solution is to keep 2 HashMaps, one for each type of lookup. You definitely want to keep boths maps in the same struct and write convenient methods for insertion, then.

You can also use a HashMap<R, Vec<I>> + Vec<V>, where I is an index into the separate vector to get around using Rc.

Pick whatever suits your needs.

1 Like

I would echo @Michael-F-Bryan's advice: try a linear search first. That might be fast enough for you.

Otherwise, I'm still trying to understand your problem. It sounds like you have a bunch of regexes and you want to find which of them matches a particular string. That is pretty much exactly what a RegexSet is. I think the main problem is that it might not do so well with so many regexes, but it's worth a try. It should hopefully be faster than a linear scan.

If you had the inverse problem, e.g., a set of strings that you wanted to match against a single regex, then an FST might be appropriate. An FST can store a set of strings, and since it is itself an automaton, you can run a regex against it. Here's an example of that: https://github.com/BurntSushi/regex-automata/blob/master/examples/fst.rs

2 Likes

That’s also pretty much the description of a traditional lexer, which compiles all of the regexes into a single NFA or DFA with multiple accept types. If that’s how RegexSet is implemented, I can’t think of any approach that would be faster (discounting compilation time).

It is, yes. Given the size, it will use an NFA. It could be quite slow. Eventually, regex-automata will support this using traditional DFAs.

1 Like

Hi Everyone,
Thank you all for providing such helpful suggestions. All the links and tips have set out a lot of reading material for me to learn about new techniques and methods.

There are probably lots of ways of approaching this problem but here is what I have been testing...

As suggested by Phlopsi, I attempted to create a HashMap of <Regex, Values> but Rust complained about Regex not satisfying the Hash trait. I tried to derive a default implementation but the compiler complained that I could not derive a trait for an external crate (or something along those lines). My current level of expertise meant, I had to stop here! :slight_smile:

Next I came across a really useful crate called MultiMap. It is basically like a HashMap but allows you to store multiple values in a vec bucket when the keys are the same.
So I basically took my keys and split them in half - one half which is guaranteed to be string literal (I call this the hash_key) and the other half which contained wildcards (I call this the regex_key).

I then pushed the fixed "hash_key"s into a multimap along with custom structs (containing the "regex_keys" and other fields) as the values. Now, because I shorted my keys it resulted in more hash collisions (as expected) but I found that each bucket contained a vector of max length 30 which is not too much to iterate through to find an exact match.

So my table lookup technique now consists of two steps:

Step 1) Super fast hash key lookup --> returns vector bucket of values
Step 2) vec.iter().find (| | ) etc.. to obtain the struct I want using the "regex_key"

When I run the code in debug mode it takes about 100s but in release mode, the speed is very acceptable for me at approx 7s!

I hope this ramble makes sense and someone else gets something useful from it and from all the excellent replies and links from the guys here!

Thanks.