This is perhaps a bit of a scope creep, but I think it fits well. In my quest to re-organize regex internals, I wanted the substring search used by regex to be available a bit lower in the DAG. And this also lets others use the same fast substring search that regex uses without pulling in the regex crate itself.
I started on this quest a few months ago when the Cloudflare folks release the sliceslice crate, which is a near direct port of the "Generic SIMD" algorithm. I noticed that it performed quite well, and did quite a bit better than regex's substring search in a lot of cases.
So what I worked on was bringing the key secret sauce from regex (using a background distribution of byte frequencies to pick "rare" bytes to use to look for candidate matches) and combine it with the "Generic SIMD" algorithm (use SIMD to quickly look for aligned matches of two bytes), while also maintaining constant space and additive (
O(m + n)) time complexity guarantees. I also worked hard to reduce latency when searching small haystacks, since constructing some of the faster searchers can take a bit more time. (And the vectorized routines generally want a minimum haystack size anyway.)
During development, I compared my implementation against glibc's
memmem implementation, along with the substring implementations found in
bstr. Generally speaking, the one I worked on does better than all of them in a very large number of cases. (Although, with
sliceslice in particular, this is perhaps not so clear. See this discussion for some discussion.) Of course, there are (and always will be) some cases where some implementations are better than others. You can take a look at the benchmarks here. The benchmarks specifically include pathological cases targeting my own implementation. These are ultimately unavoidable. The key is making sure the pathological cases still fit the time complexity guarantees and are, in practice, not too slow.
In anticipation of the question, "why not put this in std?" The answer is the same as for the recently released
simdutf8 crate: Rust's core library isn't quite setup to integrate SIMD stuff yet. (And this is also why a fast vectorized implementation of
memchr hasn't made its way into
core yet either, despite existing for a couple years now.)
As a concrete example of where this improves things, we can use ripgrep, which now uses this new implementation on master. Here's one such case:
$ time rg-pre-memmem 'strengths' OpenSubtitles2018.raw.en -c 4297 real 3.602 user 3.115 sys 0.483 maxmem 12510 MB faults 0 $ time rg 'strengths' OpenSubtitles2018.raw.en -c 4297 real 1.514 user 1.108 sys 0.403 maxmem 12510 MB faults 0
The key here is that the literal
strengths consists entirely of bytes that are pretty common in the corpus. This is where the previous algorithm used by the
regex crate did poorly. Namely, it worked by feeding one of the bytes to memchr. Once a match is found, memchr returns and regex would check if a second byte matched at the expected position. If it did, then a full
memcmp would be run to confirm the match. The main problem here is that the code spends a lot of time popping in and out of
memchr. With this new implementation based on the "Generic SIMD" algorithm, that check on the second byte is part of the vectorized loop itself. Therefore, the speed up is very simply due to spending more time in the vectorized loop.