Unicode, string segmentation, normalization, matching and tries

BurntSushi · April 15, 2023, 4:51pm

To be honest, these are IMO difficult questions. From my read of your problem description, I actually think you should just go out and do what you think is best. It seems like you have the requisite awareness of the thorniness of Unicode, and that's probably the most important bit. Basically, I'd advise you to just go out and build something based on some reasonable set of assumptions, and the iterate based on user feedback.

There are a lot of Rust project out there that can help you with the nuts & bolts (fst for the trie and levenshtein queries, regex-automata for searching the trie with a regex), but they aren't going to tell you how to assemble everything together into a coherent user experience.

Take levenshtein for example. What are the units of edit distance? Bytes? Codepoints? Graphemes? Bytes are easy to do, codepoints are a little trickier and I'm not sure how one would do graphemes efficiently. (The levenshtein support in fst does it by codepoint.) Maybe codepoints are "good enough." Maybe not. Really depends on how you want your user experience to work.

Normalization is another tricky one. I might start by stuffing NFKD forms into the trie, and then converting all queries to NFKD too. But maybe that's not the right approach, and if you use NFKD, you might also want to store NFD so that you can roundtrip what the user put into your trie. You can't with NFKD because it is lossy by design.

Overall, if you want to do this sort of thing the "right" way with respect to Unicode, then I personally would say that this is quite a sizeable project. I say that only to make sure you know what you're getting yourself into.

Topic		Replies	Views
Fast fuzzy string matching code review	22	4513	February 27, 2024
String library respecting Unicode	5	618	March 19, 2021
2nd project: regular expression searcher code review	5	415	July 3, 2023
Comparing UTF-8 characters	14	2204	August 12, 2020
String processing to match valid UUID code review	14	827	August 28, 2024

Unicode, string segmentation, normalization, matching and tries

Related topics