Modify fst to generate a persistent data structure?

#1

Would it be possible in principle to modify fst to generate a persistent data structure - ie, give the builder an existing fst, and construct a new one where as much state is shared with the original as possible.

The use case I have in mind using fst to index versioned documents where the changes between the documents are much smaller than the documents themselves.

(cc @BurntSushi)

#2

The vast majority of the complexity in the fst crate is in reading and writing its serialization format. There are a number of compression related tricks that take advantage of the structure of the FST and the way in which it is encoded. For example, states with exactly one transition that are encoded right next to the state pointed to by that transition are exceptionally common. As a result, they are encoded with a single byte. The “next” state identifier isn’t encoded at all; instead, it’s inferred based on the location of the state in the FST’s serialized form. There are many such optimizations in the serialization format, and they seem incompatible with building a persistent data structure, insomuch as I can see anyway.

Are FSTs themselves fundamentally incompatible with a persistent representation? I’m not sure. The way to implement such a data structure isn’t immediately obvious to me. So I think if someone really needs this, you might start by creating a new crate (perhaps by forking fst) and seeing if it can be done.

You might have better luck than me at this, but you might try looking into Elasticsearch/Lucene. They also use FSTs internally and support document updates. My understanding of how it works is that the document is deleted and re-indexed completely. New FSTs are thus generated. Querying then requires visiting all of the FSTs. When the number of FSTs gets too large, you start merging some of them.

1 Like
#3

New FSTs are thus generated. Querying then requires visiting all of the FSTs. When the number of FSTs gets too large, you start merging some of them.

I do something similar (in an internal tool). I keep one „big“ FST that I change less often and one „diff“ FST against that one. To build the diff one, I do need to traverse the big one ‒ so building the diff is still almost as expensive as generating the big one. However, distributing the diff to all the machines is much faster and cheaper than replacing the big one.

2 Likes