Hi!
Looking for some human feedback on a little project I "finished" the other day. Looked around a lot, and didn't find existing solutions (in any programming language), not even people talking about it. The issue is hard to describe and a bit niche anyway. So I'm a bit scared I'm missing the forest for the trees!
Problem Statement
The other day, for another WIP project of mine (this project is not the topic of this thread, just context), I ran into an issue best paraphrased as:
- there's a list of words to do lookup in; circa 2 million words, 30 MB, UTF-8 encoded
- this lookup is in the hot loop of the above project
- the entire CLI application of that project is sensitive to startup time; its processing is done in single-digit milliseconds, so even just 10 milliseconds of additional startup time would double total program run time; not good
So in an initial version I had something like:
const VALID_WORDS: &[&str] = include!(concat!(env!("OUT_DIR"), "/words.in")); // Generated in `build.rs`.
The list was sorted in build.rs
. I could then do binary_search
over it. Worked okay, and didn't have runtime overhead constructing a HashSet
would. However, the approach lead to massive issues that I haven't been able to overcome (I'm not knowledgeable in more low-level compiler stuff):
- insane compile times (10-20 minutes) on fast hardware, where without the above line it's a couple seconds. I suppose this is because 2 million (on 64-bit, 16 byte) pointers have to be shuffled about and squeezed into the binary?
- in connection to the previous point, binary size went through the roof. For a 30 MB word list and few KB of actual code, Linux binaries went to 120 MB in size, Windows around 70 MB. I figure, again, and without knowing much in that area, because of pointers. Ironically, 16 byte per pointer is more than the average word aka substring-slice is even long (c. 14 byte). A lot of overhead, basically a factor of two. Explains the Windows binary size, but not yet the Linux one.
- the IDE (VSCode) imploded;
clippy
was burning alive. I suppose it suddenly had to check 30 MB of "source code", which it choked on. Development was no longer really possible
I tried static
over const
which sounds much more applicable for this use case (a single static
, global lookup array), but didn't find it to make a difference (so it seems like the compiler didn't copy the const
around much or at all; else the binary could be hundreds of meg).
phf
didn't help much, neither did an approach involving serde
and deserializing at runtime. While phf
allows sets at compile time, the issues are identical to the naive slice approach. Which makes sense, as hashsets are arrays, just with dynamically computed indices, if I'm making sense. Deserializing, e.g. with bincode, pre-built objects (hashset in this case) at startup was incredibly slow, slower than building a HashSet
anew.
Solution
The solution is the little project this thread is about, and for which I would love to receive feedback. The fundamental idea is to have only a single &str
, no slice, have delimiters in that, like \n
, and perform binary search "raw". This indeed solved all issues:
- compile times are normal; removing the line makes no discernible difference
- binary sizes are as expected (a tad over the word list size)
- developer experience is fine;
clippy
sees a single (albeit insanely long) string and is apparently fine with it
The single &str
is sorted at build time, again via build.rs
, which is a one-off thing. A neat benefit is that once a raw, delimited text file is at hand, you're good to go. Just needs to be sorted once, for which any tool is fine.
The remaining issue is that binary search in its purest form requires evenly sized items. The single &str
however contains substrings of varying size. It might look like:
let string = "A\nB\nC\nHello\nWorld"
So the idea is to still perform binary search, but after each jump seek ahead and back to find the ASCII delimiter (usually \n
but can be any ASCII). So the core loop looks like:
pub fn binary_search<U>(&self, needle: U) -> SearchResult
where
U: AsRef<str>,
{
let haystack = self.string;
let needle = needle.as_ref();
let leftmost = 0;
let rightmost = haystack.len();
let mut low = leftmost;
let mut high = rightmost;
let mut start = leftmost;
let mut end = rightmost;
let haystack = haystack.as_bytes();
let pred = |c: &&u8| **c == self.sep.as_byte();
while low < high {
let mid = low + (high - low) / 2;
start = match haystack[..mid].iter().rev().find_position(pred) {
Some((delta, _)) => mid - delta,
None => leftmost,
};
end = match haystack[mid..].iter().find_position(pred) {
Some((delta, _)) => mid + delta,
None => rightmost,
};
let haystack_word = std::str::from_utf8(&haystack[start..end])
.expect("Indices aren't valid for slicing into haystack. They are at ASCII chars and therefore always assumed valid.");
match needle.cmp(haystack_word) {
Ordering::Less => high = mid.saturating_sub(1),
Ordering::Equal => return Ok(Range { start, end }),
Ordering::Greater => low = mid + 1,
}
}
Err(SearchError(Range { start, end }))
}
The &str
is treated as raw bytes for processing because UTF-8, as a variable-length encoding, would imply linear search to find char
s. As byte
s, one can jump into the haystack
freely.
It cannot panic at expect
(if I did my homework right...), as the separator is forced to be ASCII, and no multi-byte UTF-8 codepoint contains ASCII anywhere in it; their MSB is always 1
, ASCII MSB is always 0
. I am forcing the delimiter to be ASCII as to not have to do UTF-8 magic; it's very simple this way and shouldn't get in the way.
So that's it. I wasn't able to find anything online, as if no one ever had this problem before, which is hard to imagine. So, I'd be stoked to receive pointers in regards to:
- am I missing the forest for the trees? Are there fundamental mistakes? Did I miss an existing solution, obsoleting the below solution?
- is the crate in itself okay, in terms of documentation? I enjoy writing docs but struggle being concise. This project is weird as it was 20% writing and testing the above associated function, and 80% researching if it's even the right thing to build (benchmarking other approaches, for example)
- is the (very small) public API okay? I went back and forth with
&str
andString
, but since aconst fn
is a necessity for my use case,String
s aren't even possible (at compile time). So API users will have to deal with lifetime annotations, which isn't neat