I am new to rust and started learning by doing small projects. Now I am unsure what would be the best data structure (and most rust-idiomatic way of implementing it) to achieve the following:
I retrieve data via a REST api (json object). Each entry contains a "description" field. I would like to create a config.toml file that is read during runtime and in which a user can specify keywords and categories (potentially lots of them). For example:
The entries of the json object shall be categorized based on the specified categories. My first idea to store the categories was a HashMap that would look like this:
With this, the lookup for the category would be efficient. However, this only works if the description field contains exactly the specified keyword. Unfortunately, there may be text before and/or after the keyword. It is ensured that the keyword, if it occurs in the description, only appears once in this field and no other keyword is present, e.g. “description”: “text text keyword_a text text” but not "description": “keyword_a text text keyword_b text keyword_a text”.
As far as I can tell, the only way to determine each data entry's category is to loop over all keys in the HashMap and do something like
if description.contains(key) {
category = value;
}
What you can do depends on how freeform the description and keywords are.
One option is to split the description into words and then look up each word in the keywords. However, this will fail to match if the “word” definition you use for splitting doesn’t match the keywords (for example, what if a keyword contains punctuation?) Depending on the nature of your data, this may be already well-defined for you or it may not be. You can also use several different splitting strategies and check all the resulting words. unicode-segmentation would be one place to start with for a robust solution for one definition of “word”, though your application might only need to split on spaces.
If you want to do exactly contains() but do it faster, you can use a search algorithm precalculates efficient searching for multiple strings — this is what aho-corasick does.
What is "lots of them"? Hundreds? Thousand? Millions?
Networking is extremely slow compared to CPU speed.
Is every entry fetched separately with a request/response pair? Or contains an answer many entries? How many? Tenths? Hundreds? More?
Yeah, there a a lot of text searching algorithm out there. But I suspect, in the context of REST, Internet, JSON, the bottleneck will always be the network, relatively seldom the CPU. So simple loops and iterations will be often no problem at all.
Yes, of course, but since this is more of a toy project dedicated to learning, I am willing to take the "over-engineered" route. From a practical point of view, a simple loop would suffice, but that is not the point
The numbers of keywords may be anywhere between 10 and 100. It is one REST call that receives 1 to 1000 data sets at once, so networking is not the limiting factor here.