How to create a union of variable number of items via an interface that only supports union of 2 items?

I am using the excellent FST library by BurntSushi to implement custom search for my app and I do this in order to search for every word in the search term:

        term_list.for_each(|term| {
                let prefix_matcher = fst::automaton::Str::new(term).starts_with();
                let union = fst::automaton::Levenshtein::new(term, 2)
                    .expect("build automaton")
                    .union(prefix_matcher);
                let mut stream = self.fst.search(union).into_stream();
                while let Some((_, v)) = stream.next() {
                    let (_, locs) = self.by_word_vec.get(v as usize).unwrap();
                    pre_filtered.extend(locs);
                }
        });

Now, the library provides a way to do unions between two search automatons (as demonstrated above and documented here), but I don't see how to do one full union between all the search terms (in both prefix and proximity search modes), so that I don't have to iterate over the words (which is a new search in the FST and makes it O(n) times slower when I have more words). I did experiment with hardcoding some functions that do unions between a fixed length list of words and it does seem to be faster than O(n).

My question is what would be the right solution here to create a "generic" automaton from a variable-length list of words (nothing crazy, can be capped at something like n<10)

I think the most flexible answer is to use a OpBuilder. There's an example of such in the crate docs: fst - Rust. That should let build prefix and Levenshtein automata for each term, add them all to one search and then compute a union from all of them.

I don't believe that approach yields an order complexity improvement from your existing approach, but it should let you execute a single search and iterate over one set of results. That might wind up being faster than executing N searches from constant factors alone.

Now if you want to have a fixed set of automata (e.g., one prefix automata and one Levenshtein automata), then that is going to be a bit harder. For prefix automata, probably the simplest approach is using the regex-automata crate with the transducer feature enabled, and then writing a regex to express your prefix automaton with all of your terms. But the Levenshtein automaton will be harder. It doesn't support building an automaton from multiple terms. So you'll need to either go with the OpBuilder approach, or you'll have to implement your own automaton that supports fuzzing finding for multiple terms.

2 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.