I'm in the process of trying to get the regex
crate to 1.0. One thing that's important to me is that the API doesn't preclude us from searching text as quickly as possible. One way to test whether that's true or not is to compare the best performance I can get with something that is already known to be fast. My prototype is here: https://github.com/BurntSushi/xrep
There's a lot I could write about the comparison, but for now, let's just skip right to the issue I care about at the moment: the overhead of matching. The overhead of matching is the sequence of steps we must go through for every search in the text. While there is quite a bit that contributes to the total overhead, I claim that a significant source of that overhead is currently due to Regex
satisfying Sync
. Let me explain.
Every matching engine has some fixed amount of mutable scratch space in memory. The size of this space is typically proportional to the regular expression. It is therefore put on the heap. Since this scratch space is entirely an implementation detail, it is done using interior mutability. Trivially, this can be done with a single RefCell
. The problem with a RefCell
is that it is not Sync
, which means Regex
would not be Sync
.
Regex
not being Sync
means that the same Regex
could not be used simultaneously from multiple threads. In effect, the caller would need to put it behind a Mutex
. If a Regex
is behind a Mutex
, then it is restricted to executing only a single search at a time, which probably makes this approach mostly useless. Therefore, the caller would need to clone a Regex
for each thread it is used in. I claim that this is primarily a ergonomic loss, since the current implementation already clones the scratch space mentioned above when there is contention from multiple threads. Cloning a Regex
is still more wasteful than the current implementation, however. Just probably not that much more wasteful.
Regex
not being Sync
also means that it cannot be trivially used with lazy_static!
, which is (IMO) a huge ergonomic win at the moment for ensuring that regexes are compiled exactly once.
For these reasons, it seems like dropping Sync
is not a practical choice.
The problem is, keeping Sync
doesn't seem practical either. Consider a simple benchmark on 2GB of text from Project Gutenberg. This particular benchmark measures how long it takes to match every line of text in the file:
$ wc -l all
44174911 all
$ time egrep -a -c '.*' all
44174911
real 0m1.468s
user 0m1.233s
sys 0m0.233s
$ time xrep -c '.*' all
44174911
real 0m3.002s
user 0m2.870s
sys 0m0.130s
The above is the current code. Watch what happens when I remove the Sync
bound and use a simple RefCell
to store the mutable scratch space:
$ time xrep -c '.*' all
44174911
real 0m2.350s
user 0m2.220s
sys 0m0.130s
In other words, there is significant overhead as a result of synchronization, even in cases like this where there is zero contention. Note that there is also some overhead associated with the underlying memory pool itself, although perhaps that could be eliminated. For example, the same benchmark above, with the current code, except synchronization is removed (this is unsafe, but should do to measure the overhead of the pool put/get operations):
time xrep -c '.*' all
44174911
real 0m2.492s
user 0m2.363s
sys 0m0.127s
Given the current API, I claim it is impossible to achieve this level of performance because you cannot opt out of synchronization.
What should be done about this? Here is some food for thought:
- My knowledge of lock free algorithms is poor, therefore, some of the claims I've made above (or below) may be wrong. Please fix me.
- Synchronization is currently done inside the
mempool
crate. The short story is that it uses a spin lock. The motivation for using a spin lock is to aggressively optimize for the case when there is no contention. This is important because even if we perform worse when there is contention, the caller can elect to choose to opt out of contention altogether by simply cloningRegex
. - What does the API look like if we provide an escape hatch that lets the caller get a hold of a regex value that isn't
Sync
? I can think of some hacks with deref, and I can think of things that involve duplicating the API, but neither of those seems like a good idea to me. - One could imagine that the caller could somehow obtain ownership over the scratch space and then pass it down to each method in the public API. One could also imagine how clunky this is.
- It is possible to reduce the overhead to near zero for
find_iter
andcaptures_iter
, because we could stick the scratch space in the iterator itself. Then it could be returned to the underlying pool when the iterator is dropped. The problem here is that this doesn't help methods likeis_match
orfind
orshortest_match
, which are critical to writing agrep
-like tool. (find_iter
doesn't really make sense here, because you need fine grained control over which areas of the text you're searching.) - While a 1.0 release allows breaking changes, and I do have some planned already, I would like to avoid breaking a large number of uses of the
regex
crate. My suspicion is that this constraint paints us into a corner...
Any guidance would be most appreciated! (I note that this may be more an API design question than anything else.)