Continuing the discussion from [ANN] termcolor 1.0 is out and moved to its own repository:
@BurntSushi: I don't want to polute your termcolor announcement too much
Some background
We're open-sourcing our in-house data management platform, and this grep was to see if any of our real data was left over anywhere in the history, e.g. in example identifiers, test cases, etc.
It's about 4.200 commits, 250 MB in total, of which 110 MB in .git
.
After ~8 years of development as a "strictly internal" tool, in an academic setting, with quite a bit of contributor rotation, we know there are a lot of sloppy parts, data-privacy wise, so this is a very crude hammer to see how much censoring-work is ahead of us.
the query
I think it's mostly the query that is the problem, it wasn't exactly... elegant... It was a fairly brute-force approach: dump all "secret" identifiers from the DB (~16.000) into a file, and then do a search with --fixed-strings
..
Fortunately, my colleague shared his results with me, so we can go into more detail if you'd like. I've included the (censored) commands we ran below.
I do wish to caution that the "comparison" in its current state probably isn't very fair.
For example, the amount of work isn't exactly equal between git-grep and ripgrep; the git-grep filtered several subtrees after-the-fact (| grep -v
), whereas the ripgrep got those same subtrees as an --ignore-file
, so could prune early.
Also: the comparison where git-grep won used --word-regexp
(on both ripgrep and git-grep), the comparison where rg
won had plain strings without word-boundaries.
If you have any hypothesis you'd like to test, I'd be happy to re-run any queries,
I'd also be perfectly satisfied if you don't feel like digging in; I can imagine there are better uses of your time than this admittedly abominable "benchmark".
(Either way, we'll still be happy with ripgrep )
first query: look for 599 project identifiers, with word-boundaries
$ git grep -I --color=always --threads 4 --word-regexp --fixed-strings -f project-ids.txt |
grep -v \
-e assets/javascripts/dataTable \
-e assets/javascripts/dracula \
-e assets/javascripts/jquery-ui \
-e assets/javascripts/jquery \
-e assets/stylesheets/jquery-ui \
-e assets/lib \
-e scripts/some-local-scripts \
-e scripts/some-local-scriptdata \
-e tools/a-local-tool
real 0m4.801s
user 0m7.532s
sys 0m0.066s
----
$ cat to-ignore.txt
assets/javascripts/dataTable
assets/javascripts/dracula
assets/javascripts/jquery-ui
assets/javascripts/jquery
assets/stylesheets/jquery-ui
assets/lib
scripts/some-local-scripts
scripts/some-local-scriptdata
tools/a-local-tool
.git
grep*results.txt
misc/logo/logo.svg
# 12 lines
$ rg --color=always --hidden --ignore-file to-ignore.txt --fixed-strings --word-regexp -f project-ids.txt > grepProjects-rg-results.txt
real 0m40.831s
user 2m39.757s
sys 0m0.076s
So in this case, ripgrep was a factor 10x slower than git-grep, I assume this is roughly what you'd normally expect, since git-grep can use its internal knowledge of the repo structure to optimise its search, whereas ripgrep cannot assume anything about the files.
in the bigger search, the results reversed:
second query: look for 15.764 person identifiers, without word boundaries
git grep -I --color=always \
--threads 4 \
--fixed-strings -f person-ids.txt |
grep -v \
-e assets/javascripts/dataTable \
-e assets/javascripts/dracula \
-e assets/javascripts/jquery-ui \
-e assets/javascripts/jquery \
-e assets/stylesheets/jquery-ui \
-e assets/lib \
-e scripts/some-local-scripts \
-e scripts/some-local-scriptdata \
-e tools/a-local-tool
real 161m29.397s
user 162m17.039s
sys 0m0.620s
----
$ rg --color=always --hidden \
--ignore-file gitignore \
--fixed-strings -f person-ids.txt
real 35m9.812s
user 135m49.630s
sys 0m6.332s
We see that user-time remains fairly similar, but wall-time improved massively in rg
; this leads me to conclude that better parallelism is to blamepraise.
With such a large query size (remember, ~16K strings, with lots of self-similarity due to project-prefixes), I assume that the matching itself starts to become costly. I am unsure how the optimisation works in git-grep versus rg, but I know you've spent a lot of time on literal optimisations (aho-corasick
if I'm not mistaken?)