ANN: fst 0.4 released

I've finally had some time to devote to the fst crate. I've added a few features, did some performance improvements, made some small quality of life improvements to the API and eliminated all required dependencies. Its MSRV has been bumped to Rust 1.40.

For those that don't know, the fst crate provides a data structure for building very large sets (or maps) of keys, where all keys must be &[u8] (and all values must be u64, in the case of maps). Once a set is built, it cannot be changed. Internally, the set uses a finite state machine which will compress both prefixes and suffixes of the keys in a way that permits the set to be searched very quickly without decompressing the entire thing. One of its marquee features is that sets can be both built and searched without loading all of the keys into memory. A common way to use these sets is to memory map them.

A common use case for FSTs is to represent parts of an inverted index. For example, Lucene uses FSTs for fast token lookups. FSTs can be used for a number of other things, for example, it powers fuzzy search in RLS and rust-analyzer.

Here is a high level list of things that come in the 0.4.0 release:

  • The Fst, Set, and Map data types are now generic over anything that implements AsRef<[u8]>. This resolves some outstanding bugs where folks needed to search FSTs that were stored in a manner that older versions of the crate couldn't construct.
  • Because of the aforementioned change, there is no longer any reason for fst to depend on the memmap crate directly. Using memory maps is of course still supported.
  • The byteorder dependency was also removed. The MSRV bump permits the crate to use std's APIs for converting between integers and bytes. This removal brings fst down to zero required dependencies (and zero dependencies by default).
  • The fst-levenshtein crate has been deprecated. Instead, one should enable the levenshtein feature and import it via fst::automaton::Levenshtein.
  • The fst-regex crate has been deprecated. Instead, one should use the regex-automata with the transducer feature enabled to search FSTs with regexes. This permits using more kinds of regexes and also brings massive multiple-orders-of-magnitude speed improvements. (See below.)
  • In some cases, one can now go in the reverse direction and find the key from the value in a map.
  • FSTs generated with fst 0.4.0 or newer can now be verified using a checksum. This verification is not performed by default and must be done explicitly.

For an example of how much faster regex-automata is than the old fst-regex crate:

$ time fst-0.4.0 grep /m/sets/texts/wikipedia/titles.fst 'Bruce \w{12}'
Bruce Bairnsfather
Bruce Charlesworth
Bruce Herschensohn

real    0.026
user    0.022
sys     0.003
maxmem  10 MB
faults  0

$ time fst-0.3.5 grep /m/sets/texts/wikipedia/titles.fst 'Bruce \w{12}'
Bruce Bairnsfather
Bruce Charlesworth
Bruce Herschensohn

real    2.160
user    2.152
sys     0.007
maxmem  27 MB
faults  0

(For the fst-0.3.5 case above, the code had to be manually patched to increase the regex size limit.)

The speed difference is a combination of the fact that fst-regex was always a naive prototype and an effort on my part to improve the compilation time of large Unicode character classes. Note that in the slow case above, almost all of the time is being spent compiling the regex. Searching was and still is fast.

FSTs generated by older versions of the fst crate can still be read by fst 0.4. However, FSTs generated by fst 0.4 cannot be read by older versions of the crate. That is, so far, the FST format is backwards compatible but it is not forwards compatible. Attempting to read an FST generated by fst 0.4 with an older version of the crate will result in an error.

15 Likes

Wow. Congrats and thank you. Really nice.

2 Likes

Is it possible to read in 0.4 and convert to the new format?

Sure. Iterate over all keys/values and build a new FST.

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.