Help me reduce overhead of regex matching

So I think we have these options:

  1. Keep the API the same as today with minor breaking changes and an unavoidable performance hit.
  2. Duplicate the API as it is today so that there is a Sync type and a non-Sync type, with minor breaking changes. (e.g., @pixel's idea or something isomorphic)
  3. Drop Sync on Regex and force callers to cheaply clone Regex to use across multiple threads, with somewhat major breaking changes. (lazy_static! no longer works.)
  4. Expose a Cache trait and make Regex polymorphic over it with a default type parameter, with minor breaking changes.
  5. Revamp the API in terms of a single trait with multiple implementations, which will break all uses of regex today, but breakages should be trivial to fix.

"minor breaking changes" implies small things that I want to fix regardless of what we do here. Only niche uses of regex should be impacted (e.g., folks who implement the Replacer trait explicitly).

"major breaking changes" implies many or all uses of regex break.

I feel like there is no clearly correct choice here.

If you take a look at what he's already doing, this pretty much describes it. In fact, in the linked issue, he has measurements against the Treiber stack, but that still has overhead. His current approach is doing something similar but with a simple stack with a spinlock, but it still has some overhead.

This is what the current implementation does. A TreiberStack is slower than even a Mutex for this particular case. See:

$ git clone git://github.com/BurntSushi/mempool
$ cd mempool
$ cargo bench --features nightly
test bench::crossbeam_ms_get_put      ... bench:         108 ns/iter (+/- 3)
test bench::crossbeam_seg_get_put     ... bench:         124 ns/iter (+/- 44)
test bench::crossbeam_treiber_get_put ... bench:          94 ns/iter (+/- 2)
test bench::mempool_get_put_arc       ... bench:          43 ns/iter (+/- 0)
test bench::mempool_get_put_ref       ... bench:          26 ns/iter (+/- 0)
test bench::mpmc_get_put              ... bench:          30 ns/iter (+/- 0)
test bench::mutex_get_put             ... bench:          46 ns/iter (+/- 1)
test bench::refcell_get_put           ... bench:          15 ns/iter (+/- 1)

Currently, the regex crate uses mempool_get_put_ref.

Well, hidden and unexpected synchronization is ugly. Hidden memory allocations are ugly too.

I would make simple API:

struct Regex(Arc<CompiledRegex>); // Sync and can be put into lazy_static
struct Scanner(Arc<CompiledRegex>, Box<ScratchSpace>);
...
let mut scanner = regex.scanner()
scanner.find(x);
scanner.replace(x, y);

This makes it clear when memory is allocated and allows user to either make it's own cache of regular expressions or just allocate memory every time, if it's not on the critical path.

I don't think this makes memory allocation or synchronization any more explicit. Arc is at least as much synchronization as we're already doing, and a Scanner may still need to allocate memory. For example, the lazy DFA may use up to some memory limit, but it doesn't make sense to allocate the entire limit up front every time.

In terms of API impact, I think this approach is isomorphic to @pixel's.

It seems to me that what you really want is per-object thread-local storage. You said that you are optimizing for the case where there is no contention. Couldn't you just have an "owner" thread which has fast access to the cache, and a slow path for other threads?

Here is my rough draft of what this would look like. In the fast path it has very little overhead (one TLS lookup + one relaxed atomic load) but it can get quite slow otherwise. However this isn't too bad since this can be worked around by cloning the regex object.

// Thread id, allocated on first use using an atomic global counter. Make sure
// this doesn't return a thread id of 0 (which means no owner).
static THREAD_ID_COUNTER: AtomicUsize = ATOMIC_USIZE_INIT;
thread_local!(static THREAD_ID: usize = THREAD_ID_COUNTER.fetch_add(Relaxed) + 1);

// Non-Sync cache, only usable by a single thread at once
struct LocalCache;

// Sync cache which allows a fast path for the first owner
struct Cache {
    // Thread id of the thread which got here first
    owner: AtomicUsize,

    // LocalCache for the fast thread
    fast_cache: UnsafeCell<LocalCache>,

    // LocalCaches for other threads
    slow_caches: Mutex<HashMap<usize, *mut LocalCache>>,
}

unsafe impl Sync for Cache {}

impl Cache {
    fn get(&self) -> &LocalCache {
        // Get our thread id, this should only be a few instructions
        let id = THREAD_ID.with(|id| id);

        // Get the current owner, this is a single instruction
        let owner = self.owner.load(Relaxed);

        // Fast path, check if we are the owner
        if owner == id {
            // If we are then just return our cache
            return unsafe { &*self.fast_cache.get() };
        }

        // If there is no current owner, try to become the owner
        if owner == 0 {
            if self.owner.compare_and_swap(0, id, Relaxed) == 0 {
                // We managed to become the owner, return the fast cache
                return unsafe { &*self.fast_cache.get() };
            }
        }

        // Otherwise fall back to the slow path
        // (not listed here, basically just fetch our thread-local cache from
        // the hash table)
        // Alternatively just use whatever mechanism your are currently using
        // for the cache slow path.
    }
}
6 Likes

Ah, serves me right for not reading your code! The problem is trying to implement an unordered MPMC channel efficiently then. Looking at the ArrayQueue implementation at http://carllerche.github.io/syncbox/src/syncbox/array_queue.rs.html#60-62, it's using an Arc under the hood, which is a bit wasteful.

It seems like there should be a more efficient implementation of an MPMC channel that can have a single owner (in this case the regex), so doesn't need an Arc, which might get the performance down under that of mempool_get_put_ref.

I think @Amanieu saved the day! That approach works out really nicely. The benchmark in the OP now drops from 3.1s to 2.6s. Other benchmarks improve too!

cargo-benchcmp /tmp/regex/old ./new --threshold 5
name                                old ns/iter           new ns/iter             diff ns/iter   diff %
misc::anchored_literal_long_match   56 (6,964 MB/s)       45 (8,666 MB/s)                  -11  -19.64%
misc::anchored_literal_short_match  53 (490 MB/s)         43 (604 MB/s)                    -10  -18.87%
misc::easy0_32                      202 (292 MB/s)        183 (322 MB/s)                   -19   -9.41%
misc::easy1_1K                      259 (4,030 MB/s)      246 (4,243 MB/s)                 -13   -5.02%
misc::easy1_32                      169 (307 MB/s)        156 (333 MB/s)                   -13   -7.69%
misc::hard_32                       190 (310 MB/s)        175 (337 MB/s)                   -15   -7.89%
misc::match_class                   91 (890 MB/s)         86 (941 MB/s)                     -5   -5.49%
misc::medium_32                     192 (312 MB/s)        181 (331 MB/s)                   -11   -5.73%
misc::one_pass_long_prefix          164 (158 MB/s)        145 (179 MB/s)                   -19  -11.59%
misc::one_pass_long_prefix_not      156 (166 MB/s)        145 (179 MB/s)                   -11   -7.05%
misc::one_pass_short                116 (146 MB/s)        105 (161 MB/s)                   -11   -9.48%
misc::one_pass_short_not            117 (145 MB/s)        105 (161 MB/s)                   -12  -10.26%
sherlock::ing_suffix                654,080 (909 MB/s)    618,301 (962 MB/s)           -35,779   -5.47%
sherlock::letters                   34,233,762 (17 MB/s)  29,815,385 (19 MB/s)      -4,418,377  -12.91%
sherlock::letters_lower             33,175,030 (17 MB/s)  29,091,673 (20 MB/s)      -4,083,357  -12.31%
sherlock::repeated_class_negation   78,409,515 (7 MB/s)   85,151,748 (6 MB/s)        6,742,233    8.60%

So... Maybe this is good enough? The cost of synchronization now should be very low: one TLS op, one load and one branch. The code is here: mempool/lib.rs at tls · BurntSushi/mempool · GitHub --- I'd very much appreciate a safety audit!

1 Like

What exactly do you mean with "duplicate the API"? If you are really only talking about the interface, have you thought of having a trait exposing the API and two structs, one Sync and one not, where the Sync one uses the other?

Basically you would and up using a trait object in user code (e.g. RegEx) and the creation works either by having two functions, say new() and newParallel() each returning a trait object.

The struct that is not Sync would then have some functions (not exposed via the trait but only by using the actual struct) to set the buffer, or even pass the buffer to the specific function directly, which the Sync version of course uses so that it can have an internal buffer which is kept behind a Mutex.

Please beware that I do not know very much about the regex crate, so if this idea does either not sound or perform very well, just scrap it.

Using a trait is option (5) I think in my post above. (A trait object doesn't seem appropriate though.)

I did some experiments with an implementation of a fixed-size, lossy MPMC channel. Benchmarks:

test bench::crossbeam_ms_get_put      ... bench:          99 ns/iter (+/- 10)
test bench::crossbeam_seg_get_put     ... bench:          78 ns/iter (+/- 6)
test bench::crossbeam_treiber_get_put ... bench:          84 ns/iter (+/- 7)
test bench::lossy_mpmc_get_put        ... bench:          22 ns/iter (+/- 2)
test bench::mempool_get_put_arc       ... bench:          41 ns/iter (+/- 1)
test bench::mempool_get_put_ref       ... bench:          26 ns/iter (+/- 3)
test bench::mpmc_get_put              ... bench:          27 ns/iter (+/- 2)
test bench::mutex_get_put             ... bench:          39 ns/iter (+/- 2)
test bench::mutex_lock_unlock         ... bench:          15 ns/iter (+/- 1)
test bench::refcell_get_put           ... bench:          14 ns/iter (+/- 1)
test bench::spin_lock_unlock          ... bench:           8 ns/iter (+/- 1)

The code is at Added experimental lossy channel to benchmarks. · asajeffrey/mempool@af53172 · GitHub

If I understand the regex crate correct, there is a mutable Exec struct containing the result, multiple programs and the cache.
The result and the cache should be mutable to the search operation, but the program could be immutable.

Would it be possible to separate these and share the immutable program using Arc ?
Sharing the result across threads does not make sense to me and sharing the cache could introduce more overhead then gain.
Thus every thread would have to initialize the cache but afterwards it should run fast (without any other synchronization besides the Arc<programs> which is only used once per search call).

Edit:
I guess sharing a Regex object across threads would require to clone it. When the cache is not used yet, this shouldn't introduce much overhead (a new Arc instance and some zero fields).
As a side effect this should make the user aware of the fact that sharing it across threads does have some side effects (loosing the cache).

Instead of extracting the scratch space from the regex and passing the scratch object to the regex methods, you could instead leave the scratch space in the Regex, with an interior Rc pointer to the shared compiled regex. Then the regexes themselves aren't sync, but they still get the benefit of sharing the bits that matter.

I say this knowing almost nothing about the domain.

Regex not being Sync is sad, unfortunately. :frowning:

This is option (3) in Help me reduce overhead of regex matching - #21 by BurntSushi (i.e., Drop Sync)

That mpmc benchmark is impressive! @Amanieu's TLS cache clocks in at 2 ns/op on my machine (with mempool_get_put_ref at the same as what you reported). (Check out the tls branch in mempool.)

I'm a bit concerned about the use of Relaxed order here. If thread A owns the fast cache, then thread B becomes the owner, there's no memory barrier between A's writes to the cache and B's reads, so B can read stale data. You may need to use a stronger order.

In fact, on closer reading, the tls code uses the slow path for every thread other than the first one to own the cache, so tls is going to win big on single-threaded examples, but not so well in multi-threaded cases where most threads use the slow path.

Exactly. I maintain that we should be aggressively optimizing for the single threaded case, not only because it is probably the most common, but also because the caller doesn't have any say in the matter (unless we change the API). On the other hand, the caller can clone a Regex instead of sharing it between threads to opt out of contention.

Once an owner is determined, the owner never changes. When a Regex is cloned, an entirely new pool is created, which will select a different owner.