Tool for language grammar-aware text manipulation

I built a tool (srgn) (quick start: cargo binstall srgn && srgn --help) which can be best thought of as a marriage of tr, rg and tree-sitter:

  • tr, because the tool does text manipulation, not necessarily just matching; for example, simple character replacement, or more advanced replacing such as --> to .

    In its simplest form, srgn is very similar to tr.

  • rg, because it can recursively work on all files matching a glob, and matches content with regex support, not tr's concept of character classes

  • tree-sitter, because content can be matched minding language grammar, making it much more powerful than regular regex. The tree-sitter CLI is great but not a turnkey solution, and doesn't do manipulation.

    srgn comes bundled with a bunch of premade tree-sitter queries: this is a convenience feature, supposed to spare end users from having to learn the query language for basic use cases.

In combination, this allows for operations such as:

  • "in all Python files, replace print with logging.info" (without touching mentions of print anywhere else; see here), or
  • "for any Go file, fail CI if any struct serializes secret values by accident" (see here)

This project is medium-sized, with 4.5k SLOC (probably 1k+ of which are tests though). I'd appreciate any and all feedback, and am specifically interested in:

High-level questions

  • does this seem useful (to you)?
  • did I reinvent the wheel? Miss the forest for the trees? Do existing solutions exist? (I looked and didn't find any)
  • how are the ergonomics of:
    • the CLI? (the main thing)
    • the Rust library API? (the tool is CLI-first, but I spent a lot of effort to have a usable library as well, on which the CLI is built)
      • one of the bigger warts here is possibly scoping (== matching which input to process) using tree-sitter. I'm not fully happy with how that's modeled.

        For example, there's the concept of "premade queries" (example from Python), but those are plain enums, and not coupled by traits: each language has its own PremadeLANGQuery, and also CustomLANGQuery. They're all fully decoupled; not the worst thing, but it's just copy-pasta. Any changes would need to be repeated manually. I feel like if I returned to this part a year from now, it'd be hard to grok.

Low-level questions

  • there were some fun but surprisingly edge-casey algorithms to solve for which I couldn't find useful crates. Specifically, from tree-sitter, I get raw Vec<Range<usize>> of whatever it matched; e.g., all raw byte ranges where comments were found in a source code file.

    However, languages like TypeScript and Python have string interpolation, and tree-sitter cannot natively return (correct me if wrong; I asked about that over here also, which gives good context), "all strings except string interpolation parts". In Python, this would be "Hello " out of f"Hello {var}". tree-sitter can only return the entire string or {var}. I only want "Hello " as users should be able to manipulate (replace, uppercase, ...) that, without accidentally replacing/renaming variables such as var, which would be an error.

    As a solution, I subtract one Vec<Range<usize>> from another: the first one containing matches such as "Hello {var}", the second one {var}, leaving the desired "Hello ". It's a bit like subtraction of sets.

    Now, my solution isn't optimal in terms of complexity. The two range collections are sorted (by start) and non-overlapping, and the second, to-be-subtracted one is a strict subset of the former. I don't make full use of these existing pre-conditions, but gave up after a while... Perhaps this can be improved. Pasting the code here for completeness:

    pub(crate) fn subtract<T>(mut left: Vec<Range<T>>, right: &Vec<Range<T>>) -> Vec<Range<T>>
    where
        T: Ord + Copy + std::fmt::Debug,
    {
        let mut res = Vec::with_capacity(left.len());
    
        #[cfg(debug_assertions)]
        {
            let is_sorted =
                |ranges: &Vec<Range<T>>| ranges.windows(2).all(|w| w[0].start <= w[1].start);
            let is_not_overlapping =
                |ranges: &Vec<Range<T>>| ranges.windows(2).all(|w| w[0].end <= w[1].start);
    
            for ranges in &[&left, right] {
                trace!("Checking preconditions for ranges: {:?}", &ranges);
                debug_assert!(is_sorted(ranges));
                debug_assert!(is_not_overlapping(ranges));
            }
        }
    
        'outer: for l in &mut left {
            for r in right {
                if r.end <= l.start {
                    // Creeping in "from the left"
                    continue;
                }
    
                if r.start >= l.end {
                    // Gone past relevant range, go next
                    break;
                }
    
                if r.start > l.start {
                    // A small part to the left is 'free', aka uncovered by `r`; any later
                    // `r` will be *even further* right, so we can safely push this part.
                    res.push(l.start..r.start);
                }
    
                l.start = r.end;
    
                let is_fully_covered = l.start >= r.start && l.end <= r.end;
                if is_fully_covered {
                    // This one is unrecoverable no matter what comes next, so skip ahead.
                    continue 'outer;
                }
            }
    
            if !l.is_empty() {
                // Might have been decimated from mutation so much that it's empty now.
                res.push(l.clone());
            }
        }
    
        res
    }
    
2 Likes