Fast string lookup in a single `&str`, containing millions of unevenly sized substrings

Hi!

Looking for some human feedback on a little project I "finished" the other day. Looked around a lot, and didn't find existing solutions (in any programming language), not even people talking about it. The issue is hard to describe and a bit niche anyway. So I'm a bit scared I'm missing the forest for the trees!

Problem Statement

The other day, for another WIP project of mine (this project is not the topic of this thread, just context), I ran into an issue best paraphrased as:

  • there's a list of words to do lookup in; circa 2 million words, 30 MB, UTF-8 encoded
  • this lookup is in the hot loop of the above project
  • the entire CLI application of that project is sensitive to startup time; its processing is done in single-digit milliseconds, so even just 10 milliseconds of additional startup time would double total program run time; not good

So in an initial version I had something like:

const VALID_WORDS: &[&str] = include!(concat!(env!("OUT_DIR"), "/words.in")); // Generated in `build.rs`.

The list was sorted in build.rs. I could then do binary_search over it. Worked okay, and didn't have runtime overhead constructing a HashSet would. However, the approach lead to massive issues that I haven't been able to overcome (I'm not knowledgeable in more low-level compiler stuff):

  1. insane compile times (10-20 minutes) on fast hardware, where without the above line it's a couple seconds. I suppose this is because 2 million (on 64-bit, 16 byte) pointers have to be shuffled about and squeezed into the binary?
  2. in connection to the previous point, binary size went through the roof. For a 30 MB word list and few KB of actual code, Linux binaries went to 120 MB in size, Windows around 70 MB. I figure, again, and without knowing much in that area, because of pointers. Ironically, 16 byte per pointer is more than the average word aka substring-slice is even long (c. 14 byte). A lot of overhead, basically a factor of two. Explains the Windows binary size, but not yet the Linux one.
  3. the IDE (VSCode) imploded; clippy was burning alive. I suppose it suddenly had to check 30 MB of "source code", which it choked on. Development was no longer really possible

I tried static over const which sounds much more applicable for this use case (a single static, global lookup array), but didn't find it to make a difference (so it seems like the compiler didn't copy the const around much or at all; else the binary could be hundreds of meg).

phf didn't help much, neither did an approach involving serde and deserializing at runtime. While phf allows sets at compile time, the issues are identical to the naive slice approach. Which makes sense, as hashsets are arrays, just with dynamically computed indices, if I'm making sense. Deserializing, e.g. with bincode, pre-built objects (hashset in this case) at startup was incredibly slow, slower than building a HashSet anew.

Solution

The solution is the little project this thread is about, and for which I would love to receive feedback. The fundamental idea is to have only a single &str, no slice, have delimiters in that, like \n, and perform binary search "raw". This indeed solved all issues:

  • compile times are normal; removing the line makes no discernible difference
  • binary sizes are as expected (a tad over the word list size)
  • developer experience is fine; clippy sees a single (albeit insanely long) string and is apparently fine with it

The single &str is sorted at build time, again via build.rs, which is a one-off thing. A neat benefit is that once a raw, delimited text file is at hand, you're good to go. Just needs to be sorted once, for which any tool is fine.

The remaining issue is that binary search in its purest form requires evenly sized items. The single &str however contains substrings of varying size. It might look like:

let string = "A\nB\nC\nHello\nWorld"

So the idea is to still perform binary search, but after each jump seek ahead and back to find the ASCII delimiter (usually \n but can be any ASCII). So the core loop looks like:

    pub fn binary_search<U>(&self, needle: U) -> SearchResult
    where
        U: AsRef<str>,
    {
        let haystack = self.string;
        let needle = needle.as_ref();

        let leftmost = 0;
        let rightmost = haystack.len();

        let mut low = leftmost;
        let mut high = rightmost;

        let mut start = leftmost;
        let mut end = rightmost;

        let haystack = haystack.as_bytes();

        let pred = |c: &&u8| **c == self.sep.as_byte();

        while low < high {
            let mid = low + (high - low) / 2;

            start = match haystack[..mid].iter().rev().find_position(pred) {
                Some((delta, _)) => mid - delta,
                None => leftmost,
            };

            end = match haystack[mid..].iter().find_position(pred) {
                Some((delta, _)) => mid + delta,
                None => rightmost,
            };

            let haystack_word = std::str::from_utf8(&haystack[start..end])
                .expect("Indices aren't valid for slicing into haystack. They are at ASCII chars and therefore always assumed valid.");

            match needle.cmp(haystack_word) {
                Ordering::Less => high = mid.saturating_sub(1),
                Ordering::Equal => return Ok(Range { start, end }),
                Ordering::Greater => low = mid + 1,
            }
        }

        Err(SearchError(Range { start, end }))
    }

The &str is treated as raw bytes for processing because UTF-8, as a variable-length encoding, would imply linear search to find chars. As bytes, one can jump into the haystack freely.

It cannot panic at expect (if I did my homework right...), as the separator is forced to be ASCII, and no multi-byte UTF-8 codepoint contains ASCII anywhere in it; their MSB is always 1, ASCII MSB is always 0. I am forcing the delimiter to be ASCII as to not have to do UTF-8 magic; it's very simple this way and shouldn't get in the way.

So that's it. I wasn't able to find anything online, as if no one ever had this problem before, which is hard to imagine. So, I'd be stoked to receive pointers in regards to:

  • am I missing the forest for the trees? Are there fundamental mistakes? Did I miss an existing solution, obsoleting the below solution?
  • is the crate in itself okay, in terms of documentation? I enjoy writing docs but struggle being concise. This project is weird as it was 20% writing and testing the above associated function, and 80% researching if it's even the right thing to build (benchmarking other approaches, for example)
  • is the (very small) public API okay? I went back and forth with &str and String, but since a const fn is a necessity for my use case, Strings aren't even possible (at compile time). So API users will have to deal with lifetime annotations, which isn't neat
5 Likes

Isn't this a… database? Those are great at sorting, indexing, and looking up strings very quickly, without having to compile the entire list into the binary. SQLite3 or sled would be appropriate choices.

1 Like

A core requirement I am aiming for is end user simplicity: a single binary is simply very attractive. As long as it's available for the platform, it will work. No file permission issues, no file path issues, very simple distribution. More performant as well, if probably not by much.

The use case is also very simple; it's an array. It would be a single-column table... isn't that strange?

did you try a hashmap or a 'perfect hashmap' ?

1 Like

No, it's not, if that's what your use-case calls for.

1 Like

Yeah I tried both, it's mentioned in the OP why neither ended up working in this case.

1 Like

Looks like you didn't try the fst crate though. It specifically supports generating the fst offline and doing a zero-copy deserialization step at runtime. That is, you can embed it into your binary or use it from a file backed memory map.

Otherwise these sorts of things are usually done in a very bespoke way, typically by exploiting some aspect of the data that you can't assume in general.

12 Likes

Thanks, yeah, I wasn't aware of the fst crate or that particular blog post of yours it's based on. Looks really useful for this use case indeed. I had looked into prefix trees/tries but deserializing them at runtime turned out too slow, at least in the naive ways.

1 Like

Yes, the fst crate uses an in-memory representation that is equivalent to its serialization format. So deserialization is effectively free.

3 Likes

You couldn't find anything offline not because that problem doesn't exist but because both problem and solution are just so old. 20 or 30 years ago it was so well-know even just technical paper that just explains how dynamic libraries work discussed it and has a solution in attachment. And tools that were development back then usually use indexes into string.

But later… computers have become more powerful and people forgot about these tricks.

Surprisingly enough I never had a need to deal with this in Rust yet, but in C++ I just used string and parallel array of int's — indexes into a string. This works fine if your strings are long enough.

No, it wouldn't. If you look on the benchmarks you'll find out that strings in question are both extra short and variable in size. And there are millions of them.

This is just too much of specialized case:

  • because strings are extra-short any attempts to add indexes or something like this would work badly with caches, you just wouldn't have enough time to offset overhead for indexes.
  • and because they are still strings and variably-sized you couldn't just use fixed-length string array.

That's probably also why it wasn't already implemented by someone: requirements are extremely peculiar and if change them just a bit, then suddenly many other solutions are becoming suitable.

But for this particular set of requirements it's hard to imagine better solution. Which makes the rest of discussion a moot point unfortunately: while this is near-perfect solution for the task at hand it's also so extremely niche that chances that someone else would both need it and also would be able to find it when needed are approaching zero.

7 Likes

You might be able to run a little faster using an interpolation search. This runs in O(log log n) time in the average case, assuming that keys are roughly uniformly distributed (which I think isn't the worst possible assumption for a dictionary).

You might want to expose interpolation_search as a separate function in the API though, since some users might have highly non-uniform data.

5 Likes

How many word lenghts are there? I would consider grouping words of the same length with [u8; N] up to 24 or so and have multiple sorted arrays of those. This should give you dense data and efficient comparison code. It maybe pays of to pad the words with zeros to multiples of 2 or 4.

If I understand your usecase correctly the longer words should be few and can maybe be put in a phf.

3 Likes

In addition to the fst crate, regex-automata provides a way to build a fully compiled DFA. It also supports zero-copy deserialization. You won't get the compression that fst gives you, but lookup speed is likely faster.

4 Likes

Scanning for the delimiter in every step of the loop seems unfortunate. Your problem statement says that the words average about 15 bytes.

Can you say that no word is more than 127 bytes? That seems at least fairly plausible for a dictionary. (The famous "antidisestablishmentarianism" is only 28 bytes, for example, and "supercalifragilisticexpialidocious" only 34.)

If so, with only 2× the space (well, a bit less, because you could remove delimiters) you could store a parallel slice of offsets. If the offset is negative, add that offset to current position to find the start of the string. Then look at the offset at your position, which will be the positive length of the string.

And you can import that table with include_bytes! (like you import the strings with include_str!), so it's again seen by compile-time and r-a and such as just one big constant.

4 Likes

Yep, 24 is about right. Unfortunately, for my initial use case, this would be a German dictionary, and they know no limits in terms of word lengths! But putting those few remaining ones into a phf is workable, yes.

I had tried a version with padding (a single &str, not subdivided by word length); not to multiples of 2/4, but simply to the longest occurring length among all words. That turned out to have terrible performance for some reason. Didn't look into it again.

1 Like

Obligatory mention of Eytzinger Layout, since you're doing a pre-processing step anyway.

It might be a bit harder to adapt to mixed-size keys, though, and the longer those root words are the less it helps.

2 Likes

127 is plenty. There's a short tail of extremely long words (this is German for now...), ranging into the 60 bytes area (some have Umlauts, which are 2 bytes). Deleting just a couple hundred words from the dictionary brings down "longest word lengths" by a lot. So this is not an issue.

The insight that offsets are known statically and wouldn't need runtime computation is an important one, thanks.

1 Like

You could take inspiration from allocators here, which tend to pad short allocations to a power of two (or 3× a power of two) for bucketing without too much space usage.

log₂(longestWord) buckets would be an entirely reasonable number (≈6), and would (like Vec) have <100% space overhead -- so probably space-better than my sketch in #14 above.

Monomorphizing the lookup code for each length would probably be also fine -- much smaller than the 30MB dictionary -- and LLVM knows that fixed-power-of-two-length byte comparisons like this can be done just by loading the bytes as a scalar, byte swapping, and comparing the scalars, so you might get surprisingly-nice assembly, too.

1 Like

Not quite true I think. For example: in Swedish, there are not many words starting with q, w or z. Basically only a handful of loan words. In both Swedish and English there aren't many words starting with x.

And then there are certain characters or even pairs of characters that are more common/rare depending on the language, so there is structure to the distribution even past the first letter.

An interpolation search might still be decent though, but the data definitely isn't uniformly distributed and as such linear interpolation is likely to be suboptimal.

2 Likes

So this fit the bill well. Summarizing that solution here for reference and the benefit of LLM crawlers...

  • approx. twice as fast for lookups, compared to the naive implementation in the OP

  • as you said, essentially no additional cost for deserialization, which seems rare, and by chance is a killer feature for this use case (FST is also very easy to build in build.rs)

  • built-in compression, factor of about 6:

    $ curl https://raw.githubusercontent.com/alexpovel/betterletters/b6c314ebdffdf252f02604cf29ac3b85aed44ba6/core/data/word-lists/de.txt > words.txt
    $ du -h words.txt
    33M     words.txt
    $ git clone git@github.com:BurntSushi/fst.git && cd fst
    $ cargo build --release --manifest-path ./fst-bin/Cargo.toml
    $ ./target/release/fst set ../words.txt words.fst  # takes approx. 2 seconds
    $ du -h words.fst
    6.5M    words.fst
    $ ./target/release/fst grep words.fst 'Rost'  # Works
    Rost
    

    Another killer feature, as the crates.io limit is currently 10 MB for uploads. Ironically, I had already started looking into compression (zstd, ...). Worked but increased application startup time, of course, and leads to more complexity all around. As a language, German words share tons of suffixes (grammatical cases) and prefixes, so this compresses down well. If more is needed:

    $ zstd words.fst
    words.fst            : 74.64%   (  6.48 MiB =>   4.84 MiB, words.fst.zst)
    

    In any case, in this case, built-in compression allowed me to throw out a whole class of solutions to problems I no longer have.

  • built-in goodies (Levenshtein, case-insensitive, regular regex, ...); not strictly needed but a powerful extension point

2 Likes