Unicode, string segmentation, normalization, matching and tries

Duyie1ah · April 15, 2023, 4:35pm

I've got vague questions about how to deal with fuzzy and prefix matchings on unicode strings before launching into it blindly:

The context

After not finding any kind of a reasonable offline dictionary app to use on linux (maybe there is one that I haven't found), I thought I should write one myself. The project, empty for now is here and I've already got someone helping me out. My plan is more or less:

It should essentially start as an entry box with no result.
At each keypress, an updated list of matches is provided.
Candidates are listed in some kind of a combination of prefix & levenshtein distance ordering.
One can furthermore restrict the input and output languages to taste.
The program uses dictd database files, at least to start with.

I've got vague plan here with more details, but the short of it is I want something with as little friction as possible, with good defaults and offline, and nothing too intelligent. Ideally both TUI (think the fzf interactive interface) and gnome UI.

Now for

My questions

To match queried terms against a dictionary and get a list of matches, I believe the "right" datastructure is a trie:

If I want to look for all words of which "foo" is, up to levenshtein distance N, a prefix whose corresponding suffix it of length at most K, I just have to walk the tree on all words at levenshtein distance at most N from foo, and then get all children of those that have at most further length K.
Given matches for foo, I can probably get matches for foob without recomputing everything.
I can also match against regexes quite easily.

But then I'm faced with multiple questions:

Is what I'm saying above reasonable?
What's the proper way to segment a unicode string to this end?
How and when should I normalize my strings (should I normalize to ascii even)?
How about the german "beta" matching ss and the nordic AE or french oe?
What's the best way to attack this?
If foo has as only suffix bar, yielding the word foobar, should I have a shortcut arrow from foo to foobar with label bar ?
Is there some crate that would do all the heavy lifting for me without constraining me to preexisting design choices?

My thinking is that I should just:

Normalize
Call str.chars() to segment (hence keep all diacritics).
Depending on languages and setting, walk the trie in such a way that say é and e are considered equal, or that ss and the german beta also are.
But this would be done with some kind of a language-specific setting, and essentially be equivalent to matching against a regex.

In conclusion

Since the rust community already has lots of experience with all these text-wrangling questions (ripgrep, tantivy, skim, etc), I thought I'd ask before trying to hack dumb solutions trying to reinvent the wheel with ferraris driving around

BurntSushi · April 15, 2023, 4:51pm

To be honest, these are IMO difficult questions. From my read of your problem description, I actually think you should just go out and do what you think is best. It seems like you have the requisite awareness of the thorniness of Unicode, and that's probably the most important bit. Basically, I'd advise you to just go out and build something based on some reasonable set of assumptions, and the iterate based on user feedback.

There are a lot of Rust project out there that can help you with the nuts & bolts (fst for the trie and levenshtein queries, regex-automata for searching the trie with a regex), but they aren't going to tell you how to assemble everything together into a coherent user experience.

Take levenshtein for example. What are the units of edit distance? Bytes? Codepoints? Graphemes? Bytes are easy to do, codepoints are a little trickier and I'm not sure how one would do graphemes efficiently. (The levenshtein support in fst does it by codepoint.) Maybe codepoints are "good enough." Maybe not. Really depends on how you want your user experience to work.

Normalization is another tricky one. I might start by stuffing NFKD forms into the trie, and then converting all queries to NFKD too. But maybe that's not the right approach, and if you use NFKD, you might also want to store NFD so that you can roundtrip what the user put into your trie. You can't with NFKD because it is lossy by design.

Overall, if you want to do this sort of thing the "right" way with respect to Unicode, then I personally would say that this is quite a sizeable project. I say that only to make sure you know what you're getting yourself into.

Duyie1ah · April 15, 2023, 5:09pm

Thanks a lot for the reply! I learned about your fst crate just after submitting the question.
I don't care that much for doing unicode the "right" way as long as the user experience is good enough.
As far as I can tell, this article covers everything I need about querying the trie.
Does it expose enough API to be able to get all strings, say that match:

First, anything at edit distance at most two from "foo"
then something matching the regex [bp](ar|az)
followed by any string of length at most 3?

(It's a bit of an artificial example, but I'm wondering about how much API the crate exposes)
Since you can match against regexes and levenshtein, and you can manually follow arrow, this looks doable, but maybe combining is more complicated.
It also looks like I can manually do the ss versus german beta trick and similar hacks by just working with the arrows themselves.

BurntSushi · April 15, 2023, 5:18pm

Yes, that all sounds doable with fst's structure off the top of my head.

Duyie1ah · April 16, 2023, 9:56am

I tried using your crate, and it's a pleasure to use: everything works as smoothly as my skills allow.
As a potential very important simplification, we might use the deunicode crate for mapping against the trie, and only then filtering the results we get there.
This allows not having to worry about ss vs german beta and œ and friends to start with.

Duyie1ah · April 18, 2023, 5:39am

Mmh, I've got a few questions about that crate:
I would like to be able to:

Order returned matches by length: Say I do a query about a StartsWith automaton, I would like for it to return first exact matches, then matches that have one more letter, then two, etc… I can't find anything in the API telling me that this will be the case. Is it?
Have a andthen operation on automatons to be able to match strings that are concatenations of two operations. Is that somewhere already?

I believe both of those could be achieved by using primitive operations, but if it's hidden behind a nice API, that would be better. If those are not already present, would they fit in fst in theory?

BurntSushi · April 18, 2023, 11:37am

I think if we're going to start getting into the specifics of a particular crate, then it would be better to take it to that crate's issue tracker... But quickly:

No, you can't control the order. The order is determine by the automaton, which is lexicographic. So you'll need to find all matches and then re-order them.
Not sure. I'd need to see the code you're using.

Personally, I would suggest that you focus on getting your project working. Once it's working, if there is perhaps some room for simplifying aspects of the interaction with the fst crate, then it might make sense to swing back around and see about API changes to fst. At that point, you'll have a nice solid use case that presumably we could take a look at in more depth.

system · July 17, 2023, 11:37am

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
String library respecting Unicode	5	512	March 19, 2021
2nd project: regular expression searcher code review	5	381	July 3, 2023
Comparing UTF-8 characters	14	1813	August 12, 2020
Unicode-aware string sorting help	4	1499	August 20, 2019
Matching on regex's help help	3	1245	January 12, 2023

Unicode, string segmentation, normalization, matching and tries

The context

My questions

In conclusion

Related Topics