I’m in the process of trying to get the
regex crate to 1.0. One thing that’s important to me is that the API doesn’t preclude us from searching text as quickly as possible. One way to test whether that’s true or not is to compare the best performance I can get with something that is already known to be fast. My prototype is here: https://github.com/BurntSushi/xrep
There’s a lot I could write about the comparison, but for now, let’s just skip right to the issue I care about at the moment: the overhead of matching. The overhead of matching is the sequence of steps we must go through for every search in the text. While there is quite a bit that contributes to the total overhead, I claim that a significant source of that overhead is currently due to
Sync. Let me explain.
Every matching engine has some fixed amount of mutable scratch space in memory. The size of this space is typically proportional to the regular expression. It is therefore put on the heap. Since this scratch space is entirely an implementation detail, it is done using interior mutability. Trivially, this can be done with a single
RefCell. The problem with a
RefCell is that it is not
Sync, which means
Regex would not be
Regex not being
Sync means that the same
Regex could not be used simultaneously from multiple threads. In effect, the caller would need to put it behind a
Mutex. If a
Regex is behind a
Mutex, then it is restricted to executing only a single search at a time, which probably makes this approach mostly useless. Therefore, the caller would need to clone a
Regex for each thread it is used in. I claim that this is primarily a ergonomic loss, since the current implementation already clones the scratch space mentioned above when there is contention from multiple threads. Cloning a
Regex is still more wasteful than the current implementation, however. Just probably not that much more wasteful.
Regex not being
Sync also means that it cannot be trivially used with
lazy_static!, which is (IMO) a huge ergonomic win at the moment for ensuring that regexes are compiled exactly once.
For these reasons, it seems like dropping
Sync is not a practical choice.
The problem is, keeping
Sync doesn’t seem practical either. Consider a simple benchmark on 2GB of text from Project Gutenberg. This particular benchmark measures how long it takes to match every line of text in the file:
$ wc -l all 44174911 all $ time egrep -a -c '.*' all 44174911 real 0m1.468s user 0m1.233s sys 0m0.233s $ time xrep -c '.*' all 44174911 real 0m3.002s user 0m2.870s sys 0m0.130s
The above is the current code. Watch what happens when I remove the
Sync bound and use a simple
RefCell to store the mutable scratch space:
$ time xrep -c '.*' all 44174911 real 0m2.350s user 0m2.220s sys 0m0.130s
In other words, there is significant overhead as a result of synchronization, even in cases like this where there is zero contention. Note that there is also some overhead associated with the underlying memory pool itself, although perhaps that could be eliminated. For example, the same benchmark above, with the current code, except synchronization is removed (this is unsafe, but should do to measure the overhead of the pool put/get operations):
time xrep -c '.*' all 44174911 real 0m2.492s user 0m2.363s sys 0m0.127s
Given the current API, I claim it is impossible to achieve this level of performance because you cannot opt out of synchronization.
What should be done about this? Here is some food for thought:
- My knowledge of lock free algorithms is poor, therefore, some of the claims I’ve made above (or below) may be wrong. Please fix me.
- Synchronization is currently done inside the
mempoolcrate. The short story is that it uses a spin lock. The motivation for using a spin lock is to aggressively optimize for the case when there is no contention. This is important because even if we perform worse when there is contention, the caller can elect to choose to opt out of contention altogether by simply cloning
- What does the API look like if we provide an escape hatch that lets the caller get a hold of a regex value that isn’t
Sync? I can think of some hacks with deref, and I can think of things that involve duplicating the API, but neither of those seems like a good idea to me.
- One could imagine that the caller could somehow obtain ownership over the scratch space and then pass it down to each method in the public API. One could also imagine how clunky this is.
- It is possible to reduce the overhead to near zero for
captures_iter, because we could stick the scratch space in the iterator itself. Then it could be returned to the underlying pool when the iterator is dropped. The problem here is that this doesn’t help methods like
shortest_match, which are critical to writing a
grep-like tool. (
find_iterdoesn’t really make sense here, because you need fine grained control over which areas of the text you’re searching.)
- While a 1.0 release allows breaking changes, and I do have some planned already, I would like to avoid breaking a large number of uses of the
regexcrate. My suspicion is that this constraint paints us into a corner…
Any guidance would be most appreciated! (I note that this may be more an API design question than anything else.)