Continuing the discussion from [ANN] termcolor 1.0 is out and moved to its own repository:
@BurntSushi: I don’t want to polute your termcolor announcement too much
We’re open-sourcing our in-house data management platform, and this grep was to see if any of our real data was left over anywhere in the history, e.g. in example identifiers, test cases, etc.
It’s about 4.200 commits, 250 MB in total, of which 110 MB in
After ~8 years of development as a “strictly internal” tool, in an academic setting, with quite a bit of contributor rotation, we know there are a lot of sloppy parts, data-privacy wise, so this is a very crude hammer to see how much censoring-work is ahead of us.
I think it’s mostly the query that is the problem, it wasn’t exactly… elegant… It was a fairly brute-force approach: dump all “secret” identifiers from the DB (~16.000) into a file, and then do a search with
Fortunately, my colleague shared his results with me, so we can go into more detail if you’d like. I’ve included the (censored) commands we ran below.
I do wish to caution that the “comparison” in its current state probably isn’t very fair.
For example, the amount of work isn’t exactly equal between git-grep and ripgrep; the git-grep filtered several subtrees after-the-fact (
| grep -v), whereas the ripgrep got those same subtrees as an
--ignore-file, so could prune early.
Also: the comparison where git-grep won used
--word-regexp (on both ripgrep and git-grep), the comparison where
rg won had plain strings without word-boundaries.
If you have any hypothesis you’d like to test, I’d be happy to re-run any queries,
I’d also be perfectly satisfied if you don’t feel like digging in; I can imagine there are better uses of your time than this admittedly abominable “benchmark”.
(Either way, we’ll still be happy with ripgrep )
first query: look for 599 project identifiers, with word-boundaries
So in this case, ripgrep was a factor 10x slower than git-grep, I assume this is roughly what you’d normally expect, since git-grep can use its internal knowledge of the repo structure to optimise its search, whereas ripgrep cannot assume anything about the files.
in the bigger search, the results reversed:
second query: look for 15.764 person identifiers, without word boundaries
We see that user-time remains fairly similar, but wall-time improved massively in
rg; this leads me to conclude that better parallelism is to blamepraise.
With such a large query size (remember, ~16K strings, with lots of self-similarity due to project-prefixes), I assume that the matching itself starts to become costly. I am unsure how the optimisation works in git-grep versus rg, but I know you’ve spent a lot of time on literal optimisations (
aho-corasick if I’m not mistaken?)