I've finally had some time to devote to the fst
crate. I've added a few features, did some performance improvements, made some small quality of life improvements to the API and eliminated all required dependencies. Its MSRV has been bumped to Rust 1.40.
For those that don't know, the fst
crate provides a data structure for building very large sets (or maps) of keys, where all keys must be &[u8]
(and all values must be u64
, in the case of maps). Once a set is built, it cannot be changed. Internally, the set uses a finite state machine which will compress both prefixes and suffixes of the keys in a way that permits the set to be searched very quickly without decompressing the entire thing. One of its marquee features is that sets can be both built and searched without loading all of the keys into memory. A common way to use these sets is to memory map them.
A common use case for FSTs is to represent parts of an inverted index. For example, Lucene uses FSTs for fast token lookups. FSTs can be used for a number of other things, for example, it powers fuzzy search in RLS and rust-analyzer.
Here is a high level list of things that come in the 0.4.0
release:
- The
Fst
,Set
, andMap
data types are now generic over anything that implementsAsRef<[u8]>
. This resolves some outstanding bugs where folks needed to search FSTs that were stored in a manner that older versions of the crate couldn't construct. - Because of the aforementioned change, there is no longer any reason for
fst
to depend on thememmap
crate directly. Using memory maps is of course still supported. - The
byteorder
dependency was also removed. The MSRV bump permits the crate to use std's APIs for converting between integers and bytes. This removal bringsfst
down to zero required dependencies (and zero dependencies by default). - The
fst-levenshtein
crate has been deprecated. Instead, one should enable thelevenshtein
feature and import it viafst::automaton::Levenshtein
. - The
fst-regex
crate has been deprecated. Instead, one should use theregex-automata
with thetransducer
feature enabled to search FSTs with regexes. This permits using more kinds of regexes and also brings massive multiple-orders-of-magnitude speed improvements. (See below.) - In some cases, one can now go in the reverse direction and find the key from the value in a map.
- FSTs generated with
fst 0.4.0
or newer can now be verified using a checksum. This verification is not performed by default and must be done explicitly.
For an example of how much faster regex-automata
is than the old fst-regex
crate:
$ time fst-0.4.0 grep /m/sets/texts/wikipedia/titles.fst 'Bruce \w{12}'
Bruce Bairnsfather
Bruce Charlesworth
Bruce Herschensohn
real 0.026
user 0.022
sys 0.003
maxmem 10 MB
faults 0
$ time fst-0.3.5 grep /m/sets/texts/wikipedia/titles.fst 'Bruce \w{12}'
Bruce Bairnsfather
Bruce Charlesworth
Bruce Herschensohn
real 2.160
user 2.152
sys 0.007
maxmem 27 MB
faults 0
(For the fst-0.3.5
case above, the code had to be manually patched to increase the regex size limit.)
The speed difference is a combination of the fact that fst-regex
was always a naive prototype and an effort on my part to improve the compilation time of large Unicode character classes. Note that in the slow case above, almost all of the time is being spent compiling the regex. Searching was and still is fast.
FSTs generated by older versions of the fst
crate can still be read by fst 0.4
. However, FSTs generated by fst 0.4
cannot be read by older versions of the crate. That is, so far, the FST format is backwards compatible but it is not forwards compatible. Attempting to read an FST generated by fst 0.4
with an older version of the crate will result in an error.