Unexpectedly good ripgrep performance vs. git grep

Continuing the discussion from [ANN] termcolor 1.0 is out and moved to its own repository:

@BurntSushi: I don't want to polute your termcolor announcement too much :slight_smile:

Some background

We're open-sourcing our in-house data management platform, and this grep was to see if any of our real data was left over anywhere in the history, e.g. in example identifiers, test cases, etc.
It's about 4.200 commits, 250 MB in total, of which 110 MB in .git.

After ~8 years of development as a "strictly internal" tool, in an academic setting, with quite a bit of contributor rotation, we know there are a lot of sloppy parts, data-privacy wise, so this is a very crude hammer to see how much censoring-work is ahead of us.

the query

I think it's mostly the query that is the problem, it wasn't exactly... elegant... It was a fairly brute-force approach: dump all "secret" identifiers from the DB (~16.000) into a file, and then do a search with --fixed-strings..

Fortunately, my colleague shared his results with me, so we can go into more detail if you'd like. I've included the (censored) commands we ran below.

I do wish to caution that the "comparison" in its current state probably isn't very fair.
For example, the amount of work isn't exactly equal between git-grep and ripgrep; the git-grep filtered several subtrees after-the-fact (| grep -v), whereas the ripgrep got those same subtrees as an --ignore-file, so could prune early.
Also: the comparison where git-grep won used --word-regexp (on both ripgrep and git-grep), the comparison where rg won had plain strings without word-boundaries.

If you have any hypothesis you'd like to test, I'd be happy to re-run any queries, :slight_smile:
I'd also be perfectly satisfied if you don't feel like digging in; I can imagine there are better uses of your time than this admittedly abominable "benchmark".
(Either way, we'll still be happy with ripgrep :angel:)

first query: look for 599 project identifiers, with word-boundaries

$ git grep -I --color=always --threads 4 --word-regexp --fixed-strings -f project-ids.txt |
    grep -v \
       -e assets/javascripts/dataTable \
       -e assets/javascripts/dracula \
       -e assets/javascripts/jquery-ui \
       -e assets/javascripts/jquery \
       -e assets/stylesheets/jquery-ui \
       -e assets/lib \
       -e scripts/some-local-scripts \
       -e scripts/some-local-scriptdata \
       -e tools/a-local-tool

real    0m4.801s
user    0m7.532s
sys     0m0.066s

----

$ cat to-ignore.txt
assets/javascripts/dataTable
assets/javascripts/dracula
assets/javascripts/jquery-ui
assets/javascripts/jquery
assets/stylesheets/jquery-ui
assets/lib
scripts/some-local-scripts
scripts/some-local-scriptdata
tools/a-local-tool
.git
grep*results.txt
misc/logo/logo.svg
# 12 lines

$ rg --color=always --hidden --ignore-file to-ignore.txt --fixed-strings --word-regexp -f project-ids.txt  > grepProjects-rg-results.txt

real    0m40.831s
user    2m39.757s
sys     0m0.076s

So in this case, ripgrep was a factor 10x slower than git-grep, I assume this is roughly what you'd normally expect, since git-grep can use its internal knowledge of the repo structure to optimise its search, whereas ripgrep cannot assume anything about the files.

in the bigger search, the results reversed:

second query: look for 15.764 person identifiers, without word boundaries

git grep -I --color=always \
   --threads 4 \
   --fixed-strings -f person-ids.txt |
    grep -v \
       -e assets/javascripts/dataTable \
       -e assets/javascripts/dracula \
       -e assets/javascripts/jquery-ui \
       -e assets/javascripts/jquery \
       -e assets/stylesheets/jquery-ui \
       -e assets/lib \
       -e scripts/some-local-scripts \
       -e scripts/some-local-scriptdata \
       -e tools/a-local-tool

real    161m29.397s
user    162m17.039s
sys     0m0.620s

----

$ rg --color=always --hidden \
     --ignore-file gitignore \
     --fixed-strings -f person-ids.txt

real    35m9.812s
user    135m49.630s
sys     0m6.332s 

We see that user-time remains fairly similar, but wall-time improved massively in rg; this leads me to conclude that better parallelism is to blamepraise.
With such a large query size (remember, ~16K strings, with lots of self-similarity due to project-prefixes), I assume that the matching itself starts to become costly. I am unsure how the optimisation works in git-grep versus rg, but I know you've spent a lot of time on literal optimisations (aho-corasick if I'm not mistaken?)

1 Like

Ah. I didn't realize the query was complex. Yes, that is likely the source of the problem. Without being able to run this on my own (which is fine), I can't say for sure, but I can guess:

  • Unicode aware word boundaries can be a sore point (which are used for --word-regexp). ripgrep has a few tricks for working around this, but they are superficial and possibly defeated by the fact that multiple patterns are being used. I'm not sure about this one though, if it is a problem at all, it should be possible to fix.
  • When you use the -f flag to search multiple things, ripgrep basically just turns that into a giant alternation, e.g., id1|id2|...|idN and hands it off to the regex engine. The literal detection isn't yet smart enough to notice very large alternations of literals and turn that into a more optimized DFA. There are more subtle problems here than you might imagine, and in particular, there are parts of literal detection that I believe are quadratic in the size of the regex, which is why it's very carefully constrained such that it only runs on small inputs. (16,000 alternations will easily explode its budget, literal detection will fail and thereby fall back to the normal regex engine.) Literal detection is desperately crying out for a more principled approach, and it's on my list.
  • Given a sufficiently large number of pattern strings, it's possible for the regex engine's DFA to exceed its memory budget, and thereby fall back to the much slower NFA. You can test this hypothesis by passing something like --dfa-size-limit 2G to the ripgrep invocation. (The default is a paltry 10MB, which could definitely be blown by 16,000 literals.)
  • It is true that git grep "knows" which files to search much more quickly than ripgrep, since ripgrep re-litigates this issue by parsing and matching .gitignore files on every search. git grep, on the other hand, just read's from git's index, which can definitely allow it to run a bit faster than ripgrep. But I generally would not expect to see an order of magnitude difference here. (ripgrep could read git's index as well if it wanted to, but I've mostly resisted this temptation because it wouldn't actually let me delete any code from ripgrep and would greatly increase both operational and implementation complexity for not much gain.)

There are a couple issues related to making -f work faster in some cases:

Experimenting with improving the handling of --word-regexp and/or using Aho-Corasick (or other similar algorithms) in more cases will be much easier in a world with libripgrep (in which the regex engine is completely pluggable), which is why I've mostly avoided attempting to fix this thus far.

4 Likes

Thanks for the ideas!

That seems like a very sensible temptation to resist;
Just the fact that it is impractical to make yourself depend on implementation details of another project (which the contents of .git technically are..), makes it a good decision from a maintenance perspective.
And after that, there is the slippery slope: when special-casing git, why not also darcs, pijul, mercurial, etc..
Indeed, best leave the git-interpreting to git :slight_smile:

I'm 90% sure we're hitting this; most of our ID's are of the form $PROJECT_PREFIX-$NUMBER, with dozens to hundreds of IDs sharing each prefix, so a proper DFA could optimise that a lot
Not that this should be a priority, we could do a lot on our side in improving our search-set, extracting the prefixes ourselves.

I've run a new series for the project-ids (~600 needles) on my machine.
First, I've added the grep -v ignores to .gitignore so that both tools should ignore them in the same way, to equalise the workload.
Second: the repo is on a local SSD, and I ran the queries multiple times to pre-heat the filesystem cache. The heating didn't have significant effects over three runs.

# git, word
time (git grep \
   -I  --threads 4 \
   --color=always --word-regexp --fixed-strings -f $WORK_DIR/project-ids.txt \
   > $WORK_DIR/grepProjects-gitword-results.txt)
real	0m5.287s
user	0m8.062s
sys		0m0.043s
# 56 results

# git, plain
time (git grep \
   -I  --threads 4 \
   --color=never --fixed-strings -f $WORK_DIR/project-ids.txt \
   > $WORK_DIR/grepProjects-gitplain-results.txt)
real	0m5.435s
user	0m8.088s
sys	0m0.049s
# 419 results

# rg, word
time (rg  \
   --hidden --ignore-file $WORK_DIR/to-ignore.txt \
   --color=never --word-regexp --fixed-strings -f $WORK_DIR/project-ids.txt \
   > $WORK_DIR/grepProjects-rg-results.txt)
real	0m55.425s
user	3m12.984s
sys		0m0.346s
# 48 results

# rg, plain
time (rg  \
   --hidden --ignore-file $WORK_DIR/to-ignore.txt \
   --color=never --fixed-strings -f $WORK_DIR/project-ids.txt \
   > $WORK_DIR/grepProjects-rgplain-results.txt)
real	0m0.129s (!!!)
user	0m0.260s
sys	0m0.030s
# 264 results

# rg, word, dfa 2G
time (rg  \
   --hidden --ignore-file $WORK_DIR/to-ignore.txt \
   --dfa-size-limit 2G \
   --color=never --word-regexp --fixed-strings -f $WORK_DIR/project-ids.txt \
   > $WORK_DIR/grepProjects-rgdfa2g-results.txt)
real	0m56.283s
user	3m17.034s
sys		0m0.248s

It seems the DFA-limit didn't change much for the 600-project-names case (which makes sense, it probably stays inside the 10 MB budget)

The word-regexp mattered hugely for rg though, virtually nothing for git grep.
rg goes from almost a minute to a tenth of a second

The other queries I'll have to run tomorrow, they take to long, and it is "Feierabend" here. Time to go home :slight_smile:

@juleskers Thanks for running those! Based on that, it seems to me like the Unicode word boundary (added as part of --word-regexp) is probably preventing the DFA from running at all, and therefore falls back to the slower NFA. The DFA can't handle Unicode word boundaries sadly, so if the DFA sees a non-ASCII byte, it trips and fails over to the NFA.

But yeah, the --dfa-size-limit might prove more useful for the 16,000 pattern case.

2 Likes

More benchmark results :tada:

first of all: --dfa-size-limit 2G is our new hero: a blazing 13.4 seconds instead of well over an hour of my CPU-fan running at full tilt while my computer can't even keep up with my typing :smile:
I'm glad you mentioned it before we did our 32K sample identifiers search!
(Side note: for that 32K search my colleague was just now investigating --regex-size-limit 100M, to get it to run at all)

I also discovered that we could have known about --dfa-size-limit before, it is right there in the help text (emphasis mine):

--dfa-size-limit <NUM+SUFFIX?>
The upper size limit of the regex DFA. The default limit is 10M. This should
only be changed on very large regex inputs where the (slower) fallback regex
engine may otherwise be used if the limit is reached.

The argument accepts the same size suffixes as allowed in with the
--max-filesize flag.

You've obviously documented it properly, but we still missed it (I guess out of laziness... "it's just a one-off" etc., so we never really read that deep into the docs,)
We're bioinformaticians, and we regularly submit cluster jobs with walltime limits of a week without blinking.
I guess we just don't have a proper feel for how fast text-search should be, so we didn't immediately raise eyebrows at "only" a couple of hours.. We certainly have adjusted our expectations :smile:

Since I am now so ecstatically happy with --dfa-size-limit, how can we spread the joy to more people?
Would it make sense to raise the default limit up from 10MB?
You mention there are "more subtle issues than you might imagine", could you go into (a bit) more detail?
My current thinking is that in the case of --fixed-strings, we know we are getting a large alternation, so maybe we can avoid some of those subtle issues in that case? (I'd be interested in contributing a heuristic; my boss adds "within reason")

benchmark

# 16K rg
time (rg \
   --color=never --hidden --ignore-file $WORK_DIR/to-ignore.txt \
   --fixed-strings -f $WORK_DIR/person-ids.txt \
   > $WORK_DIR/grepPersons-rg-results.txt)
real	74m34.008s
user	240m57.074s
sys	0m31.244s
# 105 results

# 16K rg, dfa limit 2G
time (rg \
   --color=never --hidden --ignore-file $WORK_DIR/to-ignore.txt \
   --dfa-size-limit 2G \
   --fixed-strings -f $WORK_DIR/person-ids.txt \
   > $WORK_DIR/grepPersons-rg-results.txt)
real	0m13.428s (!!!)
user	0m41.275s
sys	0m0.442s
# 105 results
2 Likes

Perhaps ripgrep could print a warning to stderr when it is falling back to a nfa for size reasons, which points to the flag?

7 Likes

Indeed. I expect this limit to be increased in the next release. 10MB is too stingy. But there will always be a point at which the limit is too small. :slight_smile:

That had to do with literal detection specifically. I don't think I want to open that can of worms. It probably be better to just read the code in the regex-syntax crate to witness the intricacies. The extraction process just needs a lot of TLC and more flexibility for common cases like "a giant alternation of literals."

But yeah, glad to hear you are getting better performance!

Dunno. This is really an implementation detail of the regex engine. ripgrep doesn't make this decision.

3 Likes