Memchr 2.4 now has an implementation of memmem

This is perhaps a bit of a scope creep, but I think it fits well. In my quest to re-organize regex internals, I wanted the substring search used by regex to be available a bit lower in the DAG. And this also lets others use the same fast substring search that regex uses without pulling in the regex crate itself.

See the new docs for memchr for more details. The memmem implementation is in the memmem submodule.

I started on this quest a few months ago when the Cloudflare folks release the sliceslice crate, which is a near direct port of the "Generic SIMD" algorithm. I noticed that it performed quite well, and did quite a bit better than regex's substring search in a lot of cases.

So what I worked on was bringing the key secret sauce from regex (using a background distribution of byte frequencies to pick "rare" bytes to use to look for candidate matches) and combine it with the "Generic SIMD" algorithm (use SIMD to quickly look for aligned matches of two bytes), while also maintaining constant space and additive (O(m + n)) time complexity guarantees. I also worked hard to reduce latency when searching small haystacks, since constructing some of the faster searchers can take a bit more time. (And the vectorized routines generally want a minimum haystack size anyway.)

During development, I compared my implementation against glibc's memmem implementation, along with the substring implementations found in std, twoway, sliceslice, regex and bstr. Generally speaking, the one I worked on does better than all of them in a very large number of cases. (Although, with sliceslice in particular, this is perhaps not so clear. See this discussion for some discussion.) Of course, there are (and always will be) some cases where some implementations are better than others. You can take a look at the benchmarks here. The benchmarks specifically include pathological cases targeting my own implementation. These are ultimately unavoidable. The key is making sure the pathological cases still fit the time complexity guarantees and are, in practice, not too slow.

In anticipation of the question, "why not put this in std?" The answer is the same as for the recently released simdutf8 crate: Rust's core library isn't quite setup to integrate SIMD stuff yet. (And this is also why a fast vectorized implementation of memchr hasn't made its way into core yet either, despite existing for a couple years now.)

As a concrete example of where this improves things, we can use ripgrep, which now uses this new implementation on master. Here's one such case:

$ time rg-pre-memmem 'strengths' OpenSubtitles2018.raw.en -c

real    3.602
user    3.115
sys     0.483
maxmem  12510 MB
faults  0

$ time rg 'strengths' OpenSubtitles2018.raw.en -c

real    1.514
user    1.108
sys     0.403
maxmem  12510 MB
faults  0

The key here is that the literal strengths consists entirely of bytes that are pretty common in the corpus. This is where the previous algorithm used by the regex crate did poorly. Namely, it worked by feeding one of the bytes to memchr. Once a match is found, memchr returns and regex would check if a second byte matched at the expected position. If it did, then a full memcmp would be run to confirm the match. The main problem here is that the code spends a lot of time popping in and out of memchr. With this new implementation based on the "Generic SIMD" algorithm, that check on the second byte is part of the vectorized loop itself. Therefore, the speed up is very simply due to spending more time in the vectorized loop.


Thanks for sharing and the detailed docs in the code, what a treasure. :slightly_smiling_face: