Help me reduce overhead of regex matching

I'm in the process of trying to get the regex crate to 1.0. One thing that's important to me is that the API doesn't preclude us from searching text as quickly as possible. One way to test whether that's true or not is to compare the best performance I can get with something that is already known to be fast. My prototype is here: https://github.com/BurntSushi/xrep

There's a lot I could write about the comparison, but for now, let's just skip right to the issue I care about at the moment: the overhead of matching. The overhead of matching is the sequence of steps we must go through for every search in the text. While there is quite a bit that contributes to the total overhead, I claim that a significant source of that overhead is currently due to Regex satisfying Sync. Let me explain.

Every matching engine has some fixed amount of mutable scratch space in memory. The size of this space is typically proportional to the regular expression. It is therefore put on the heap. Since this scratch space is entirely an implementation detail, it is done using interior mutability. Trivially, this can be done with a single RefCell. The problem with a RefCell is that it is not Sync, which means Regex would not be Sync.

Regex not being Sync means that the same Regex could not be used simultaneously from multiple threads. In effect, the caller would need to put it behind a Mutex. If a Regex is behind a Mutex, then it is restricted to executing only a single search at a time, which probably makes this approach mostly useless. Therefore, the caller would need to clone a Regex for each thread it is used in. I claim that this is primarily a ergonomic loss, since the current implementation already clones the scratch space mentioned above when there is contention from multiple threads. Cloning a Regex is still more wasteful than the current implementation, however. Just probably not that much more wasteful.

Regex not being Sync also means that it cannot be trivially used with lazy_static!, which is (IMO) a huge ergonomic win at the moment for ensuring that regexes are compiled exactly once.

For these reasons, it seems like dropping Sync is not a practical choice.

The problem is, keeping Sync doesn't seem practical either. Consider a simple benchmark on 2GB of text from Project Gutenberg. This particular benchmark measures how long it takes to match every line of text in the file:

$ wc -l all
44174911 all
$ time egrep -a -c '.*' all
44174911

real    0m1.468s
user    0m1.233s
sys     0m0.233s
$ time xrep -c '.*' all
44174911

real    0m3.002s
user    0m2.870s
sys     0m0.130s

The above is the current code. Watch what happens when I remove the Sync bound and use a simple RefCell to store the mutable scratch space:

$ time xrep -c '.*' all
44174911

real    0m2.350s
user    0m2.220s
sys     0m0.130s

In other words, there is significant overhead as a result of synchronization, even in cases like this where there is zero contention. Note that there is also some overhead associated with the underlying memory pool itself, although perhaps that could be eliminated. For example, the same benchmark above, with the current code, except synchronization is removed (this is unsafe, but should do to measure the overhead of the pool put/get operations):

time xrep -c '.*' all
44174911

real    0m2.492s
user    0m2.363s
sys     0m0.127s

Given the current API, I claim it is impossible to achieve this level of performance because you cannot opt out of synchronization.

What should be done about this? Here is some food for thought:

  • My knowledge of lock free algorithms is poor, therefore, some of the claims I've made above (or below) may be wrong. Please fix me.
  • Synchronization is currently done inside the mempool crate. The short story is that it uses a spin lock. The motivation for using a spin lock is to aggressively optimize for the case when there is no contention. This is important because even if we perform worse when there is contention, the caller can elect to choose to opt out of contention altogether by simply cloning Regex.
  • What does the API look like if we provide an escape hatch that lets the caller get a hold of a regex value that isn't Sync? I can think of some hacks with deref, and I can think of things that involve duplicating the API, but neither of those seems like a good idea to me.
  • One could imagine that the caller could somehow obtain ownership over the scratch space and then pass it down to each method in the public API. One could also imagine how clunky this is.
  • It is possible to reduce the overhead to near zero for find_iter and captures_iter, because we could stick the scratch space in the iterator itself. Then it could be returned to the underlying pool when the iterator is dropped. The problem here is that this doesn't help methods like is_match or find or shortest_match, which are critical to writing a grep-like tool. (find_iter doesn't really make sense here, because you need fine grained control over which areas of the text you're searching.)
  • While a 1.0 release allows breaking changes, and I do have some planned already, I would like to avoid breaking a large number of uses of the regex crate. My suspicion is that this constraint paints us into a corner...

Any guidance would be most appreciated! (I note that this may be more an API design question than anything else.)

Related issues: #188 and #192.

8 Likes

It sounds like to me that you want a "state" object that is returned from the match. This object would hold any of the stateful information, and could only be owned by one thread. So you would have something like:

impl Regex {
    pub fn new(exp:String) -> Regex;
    pub fn match(&'a self, input : &'b str) -> RegexMatch<'a, 'b>;
}

RegexMatch would have the group/count/whatever methods to process the results of the match operation.

Or I could be missing the goal?

I think that falls under the "create a duplicate API" class of solutions, right? Either that, or the entire match API is really defined on RegexMatch and every caller must get a RegexMatch before doing any search.

The latter would, I think, break all existing uses of Regex. Sure it's legal for 1.0, but I've been trying to avoid that. (This is an unstated constraint, I'll add it to my OP.)

Just throwing an idea out there (don't know that much of the specifics), but could thread_local! be useful here? You could perhaps have a thread-local cache for scratch spaces and whenever you send the regex across threads it'll just allocate a new scratch space on the first match. In theory that'd get pretty close to RefCell maybe?

I looked at thread_local! because I had that same idea but couldn't quite figure out how it would work. Could you help me figure out the right sequence of steps? (FYI, I've never used thread_local! before.) I guess the pieces I don't quite grok are:

  1. Every compiled regex needs its own scratch space that is distinct from any other compiled regex. Thread locals need to be declared statically, so I assume this means that I need to come up with a scheme that allows multiple regexes to share the same thread local space? Is that right? Each regex would then need to know how to access its particular scratch space for each thread.
  2. When a regex is dropped, all of its scratch space should be dropped too. It seems like we'd therefore need a way to drop any extra space used by this regex in the thread local?

Given 1+2, it feels like it's headed straight into "write an allocator." I'm not intrinsically opposed to it, but it doesn't seem wise unless there's a simple implementation that is escaping me at the moment.

The exact semantics of the scratch space are very simple. I think this pseudo code should capture the current implementation:

// compilation
let prog = compile(parse("the pattern"));
let pool = Pool::new(prog.len());
Regex { prog: prog, pool: pool }

// search
// `re` is a `Regex`
let mut cache = re.pool.get(); // this is "expensive" because it must synchronize
re.prog.search(cache, the_text_to_search)

// the pool
impl Pool {
    // always returns scratch space usable for searching.
    // it may reuse scratch space returned to this pool.
    // it is safe to call from multiple threads simultaneously.
    fn get(&self) -> Cache;
}

And that's pretty much it.

I had thought about thread_local, what I think you would have to do is store a map in the thread_local that way if you are using 2 regex instances at the same time they each have their own scratch space.

Although this would make Regex non Sync...well it would be Sync but if you actually tried to start an operation on 1 thread, and complete it on another, i.e. use the scratch space started on the first thread things would go bad, since that space wouldn't be available on that thread.

So that would end up being something like:

lazy_static!{static ref KEY : AtomicUsize = AtomicUsize::new(0);}
thread_local!(static CACHE: RefCell<HashMap<u64, Cache>> = RefCell::new(HashMap::new()));

impl RegEx {
    pub fn new() -> RegEx {
        let key = KEY. fetch_add(1, Ordering::Relaxed);
        CACHE.with(|f| f.insert(key, Cache::new());

        RegEx {
            cache_key : key
        }
    }
}

impl Drop {
    fn drop(&mut self) {
        CACHE.with(|f| {
            f.remove(self.cache_key);
    });
}

When the object is freed, drop() runs and clear's it's cached data from the map. Which thinking about it you'd have a leak if the object was moved to another thread then destroyed because in that thread wouldn't have map where the object was created.

Basically you'd have to ensure that whatever method causes the cache to be created, the drop (or a manual free method) is called on the same thread. Which I feel would end up being handled in documentation so it's entirely possible that someone would ignore that and start leaking.

Ah yeah with those requirements thread_local! probably won't work. At the very least you'd have to accept a leak if you match on one thread and drop on another (at least until that thread exits). You're also right in that you couldn't have a thread local per instance of Regex, so you'd basically just be reimplementing a thread-local allocator, which sounds a lot like jemalloc :slight_smile:

So ah well, I guess thread locals won't work out :frowning:

It's even worse: there can be multiple cache entries that exist at any one point in time. If one thread calls find at the same time as another thread, then one thread will get the existing cache entry and the other thread will allocate a totally new one. These are all anchored to the Regex value controlled by the caller (since it is what ultimately owns the pool) and only get dropped when Regex gets dropped. In that case, you'd end up needing some kind of regex handle that gets dropped on a per-thread basis for thread local storage, but that's just a mirror image of where we are now. :frowning:

So... I'm increasingly coming to the conclusion that the only way to make something like this work is with @pixel's idea, but if we don't want to break the world, we'll have to offer that in tandem with the existing API, which just seems dreadful to me. "Here's the regular API, but then here's this other API that's pretty much the same but it's maybe faster." Gah.

It sounds like this scratch space is not re-used between two invocations of the regex; is that correct? So it sounds like the main reason for having the scratch space is to avoid hitting the allocator when we know that we will be repeatedly allocating objects of the same size over and over again. However, due to this synchronization overhead, it seems like the scratch space approach is actually fairly costly. Have you tried comparing what it would cost to just use the allocator to get this scratch space? I thought that jemalloc was supposed to be fairly well suited to repeated allocations and deallocations of similar sized objects in quick succession from multiple threads, so rather than writing an allocator, how does using the allocator we already have compare?

Or am I incorrect, and this scratch space should be preserved between invocations of the regex?

I also don't really see too much of a problem with the "duplicate the API" approach; it has precedent in things like the Stdin.lock() approach. Having a trait that represents everything you could do on a Sync regex or the non-Sync object you can extract, and having the Sync implementation just obtain the non-Sync version paying the performance penalty, wouldn't be the end of the world; it does mean you have to be aware of this to squeeze out the most performance, but it also provides a simple and easy to use API for the vast majority of the time when squeezing the utmost performance out isn't critical.

Ah, yup, I left that bit out! Indeed, the whole point of the cache is for it to be reused between invocations of the same Regex value. e.g., let re = Regex::new("..."); re.is_match("..."); re.find("..."); will allocate cache entries lazily on the first call to is_match and reuse them in find. I recall doing this a while back lead to significant performance wins (which did somewhat surprise me since I guessed the underlying allocator should optimize this case well, as you say). In any case, that's no longer relevant because the DFA doesn't just reuse the allocation, it also reuses the data inside the cache as well. The cache contains computed states. This is probably the most critical component of how we get fast regexes! :slight_smile: The other matching engines don't reuse the cache in this way. Only the DFA does.

I guess I'd feel a little better with a trait. At least then the API would be documented in one place. One advantage to this is that bytes::Regex could implement this same trait (probably with an associated type). I had previously poo-pooed the idea as not having enough motivation, but if the trait is going to be there anyway, we might as well...

So i'm guessing the reason you want to use the same regex on multiple threads is to save the memory/cpu overhead of decoding the regex expression, correct? And I'm assuming once the regex object is created this state is effectively read-only?

If that's true then you could store this state in an Arc element and require developer clone the regex for each thread they want to run on. This should make the clone fairly inexpensive. You don't have the Sync but the clone wouldn't be as bad in this situation.

As @BurntSushi said, it's not only decoding, but also the DFA state cache.

As an example, I've been hacking on an ack-like tool which searches a directory tree concurrently in multiple threads. One thread generates filenames that are then sent to a task function executed in worker threads using a thread pool.

If Regex wasn't Sync, I'd have to construct the Regex each time in the task function (which means for each file, not each thread!) or share the Regex objects in one of the worker threads between tasks (with threadlocals, presumably), which neatly destroys the abstraction of the thread pool.

That would work by putting only the read-only state behind an Arc internally, yes. So that does alleviate my performance concerns with dropping Sync at least.

But as I mentioned, I think the more important issue with dropping Sync is about ergonomics. It wouldn't work with lazy_static! for example, which requires Sync. (And therefore would break a lot of code, but probably not all of it.)

One problem with introducing a trait is that it needs to be in scope, which also means all uses today of Regex would break. :-/

Could you make the faster state-object-level approach a new regex2 crate (and perhaps refactor regex to use it)?

This would break no existing code, and people would be able to choose between easy and fast.

It sounds to me that if you want interior mutability and Sync you either have to use a Mutex/RwLock or store the mutable data outside of the object.

If you store the mutable state outside of the object it's either a second object on the stack or in a static variable.

If you store it in a a static variable, then it's in a Map that that maps object instances to their mutable data.

If you want the the mutable data shared between threads, then that needs it's own Mutex/RwLock and is probably defeating the purpose of multiple threads.

I don't think we can escape these restrictions in Rust. In other languages we'd just say the object isn't thread safe and leave it to the caller to ensure it's not called simultaneously on multiple threads. But Rust is going to ensure the object is used correctly by allowing or denying us the Sync trait.

Loosing the Sync trait and making the clone() as efficient as possible has the advantage of keeping the API the same. Using a second "state" object on the stack changes the API but allows us to keep the Sync trait. I just don't see a third option.

I imagine you could use some kind of Cache trait that can be either Sync (easy) or not Sync (fast). The Sync-ness would be propagated to the main Regex type. You can specify a default impl in your Regex type so that you don't break existing things.

Hmm, I'm not sure I see too much difference between a separate crate and a major version bump. regex is used a lot, but it's not quite foundational in the same sense that, say, libc is since I assume types in regex aren't frequently shared across crate boundaries. (And I think most folks are off of the crate = "*" business by now, so a semver bump shouldn't break the world on its own.)

That is an interesting idea. I think using a trait might not be such a good idea since that trait would then need to be public, and it's really an implementation detail. That would work though, I think.

One strategy here would be to use a Cache pool, rather than a single cache. initially, the pool is empty, when a consumer wants access to a cache, they try popping one from the pool, and if this fails, they create a new cache. Once they're done, they push the cache back into the pool.

The data structure just needs to support non-blocking push and pop, so a Trieber stack would do the trick: there's an implementation in Rust at http://aturon.github.io/crossbeam-doc/crossbeam/sync/struct.TreiberStack.html.

The main problem with this approach is that each cache is maintaining its state independently, whenever a new cache is created it starts fresh, and there's no attempt to merge or share state between caches. There are probably solutions to this, with strategies for cache merging and reclamation, but this is probably more trouble than its worth.