I’d like to nominate @BurntSushi’s FST crate .
This provides a very compact in-memory representation of maps, in the same ballpark as gzip, but queryable!
The compact representation makes analyses possible that would otherwise explode the RAM of even the biggest compute nodes.
In Andrew’s typical style, the crate, the theory behind it, and the design trade-offs are extremely well documented via blogpost. After reading it, I felt like I had learned something new about theoretical computer science, and had a good feeling for when to use the crate.
My department is currently using the crate to analyse genome regulation, by identifying the counts of all possible 16-mers in the human genome, so 4^16 = 4.294.967.296 keys. Our algorithm was impossibly slow before we encountered FST, producing hundreds of gigabytes of temp-files that had to be processed line-by-line.
Thanks to FST we can do unbiased analysis into what makes humans tick: how our genomes are regulated.
I’ll add that we’re using
fst via it’s python bindings, by Johannes Baiter. Johannes was supportive and quickly responded to our questions. (He works at the digitisation centre of Bavarian State Library, so I’m very curious what use-case he found for
I include this tidbit to demonstrate that there is already a (minute) user-community around
fst, and that it is (apparently) broadly applicable.
All in all:
fst is a crate that, through its great implementation, excellent documentation and great API, makes high-performance data structures available to a new audience, enabling real research and bringing humanity forward.
Andrew Gallant / @BurntSushi
(p.s. also a little for @ehiggs, who introduced me to it)