I built a tool (srgn) (quick start: cargo binstall srgn && srgn --help) which can be best thought of as a marriage of tr, rg and tree-sitter:
-
tr, because the tool does text manipulation, not necessarily just matching; for example, simple character replacement, or more advanced replacing such as-->to⟶.In its simplest form,
srgnis very similar totr. -
rg, because it can recursively work on all files matching a glob, and matches content with regex support, nottr's concept of character classes -
tree-sitter, because content can be matched minding language grammar, making it much more powerful than regular regex. Thetree-sitterCLI is great but not a turnkey solution, and doesn't do manipulation.srgncomes bundled with a bunch of premadetree-sitterqueries: this is a convenience feature, supposed to spare end users from having to learn the query language for basic use cases.
In combination, this allows for operations such as:
- "in all Python files, replace
printwithlogging.info" (without touching mentions ofprintanywhere else; see here), or - "for any Go file, fail CI if any
structserializes secret values by accident" (see here)
This project is medium-sized, with 4.5k SLOC (probably 1k+ of which are tests though). I'd appreciate any and all feedback, and am specifically interested in:
High-level questions
- does this seem useful (to you)?
- did I reinvent the wheel? Miss the forest for the trees? Do existing solutions exist? (I looked and didn't find any)
- how are the ergonomics of:
- the CLI? (the main thing)
- the Rust library API? (the tool is CLI-first, but I spent a lot of effort to have a usable library as well, on which the CLI is built)
-
one of the bigger warts here is possibly scoping (== matching which input to process) using
tree-sitter. I'm not fully happy with how that's modeled.For example, there's the concept of "premade queries" (example from Python), but those are plain
enums, and not coupled bytraits: each language has its ownPremadeLANGQuery, and alsoCustomLANGQuery. They're all fully decoupled; not the worst thing, but it's just copy-pasta. Any changes would need to be repeated manually. I feel like if I returned to this part a year from now, it'd be hard to grok.
-
Low-level questions
-
there were some fun but surprisingly edge-casey algorithms to solve for which I couldn't find useful crates. Specifically, from
tree-sitter, I get rawVec<Range<usize>>of whatever it matched; e.g., all raw byte ranges where comments were found in a source code file.However, languages like TypeScript and Python have string interpolation, and
tree-sittercannot natively return (correct me if wrong; I asked about that over here also, which gives good context), "all strings except string interpolation parts". In Python, this would be"Hello "out off"Hello {var}".tree-sittercan only return the entire string or{var}. I only want"Hello "as users should be able to manipulate (replace, uppercase, ...) that, without accidentally replacing/renaming variables such asvar, which would be an error.As a solution, I subtract one
Vec<Range<usize>>from another: the first one containing matches such as"Hello {var}", the second one{var}, leaving the desired"Hello ". It's a bit like subtraction of sets.Now, my solution isn't optimal in terms of complexity. The two range collections are sorted (by
start) and non-overlapping, and the second, to-be-subtracted one is a strict subset of the former. I don't make full use of these existing pre-conditions, but gave up after a while... Perhaps this can be improved. Pasting the code here for completeness:pub(crate) fn subtract<T>(mut left: Vec<Range<T>>, right: &Vec<Range<T>>) -> Vec<Range<T>> where T: Ord + Copy + std::fmt::Debug, { let mut res = Vec::with_capacity(left.len()); #[cfg(debug_assertions)] { let is_sorted = |ranges: &Vec<Range<T>>| ranges.windows(2).all(|w| w[0].start <= w[1].start); let is_not_overlapping = |ranges: &Vec<Range<T>>| ranges.windows(2).all(|w| w[0].end <= w[1].start); for ranges in &[&left, right] { trace!("Checking preconditions for ranges: {:?}", &ranges); debug_assert!(is_sorted(ranges)); debug_assert!(is_not_overlapping(ranges)); } } 'outer: for l in &mut left { for r in right { if r.end <= l.start { // Creeping in "from the left" continue; } if r.start >= l.end { // Gone past relevant range, go next break; } if r.start > l.start { // A small part to the left is 'free', aka uncovered by `r`; any later // `r` will be *even further* right, so we can safely push this part. res.push(l.start..r.start); } l.start = r.end; let is_fully_covered = l.start >= r.start && l.end <= r.end; if is_fully_covered { // This one is unrecoverable no matter what comes next, so skip ahead. continue 'outer; } } if !l.is_empty() { // Might have been decimated from mutation so much that it's empty now. res.push(l.clone()); } } res }