Tool for language grammar-aware text manipulation

alexpovel · December 3, 2023, 12:59pm

I built a tool (srgn) (quick start: cargo binstall srgn && srgn --help) which can be best thought of as a marriage of tr, rg and tree-sitter:

tr, because the tool does text manipulation, not necessarily just matching; for example, simple character replacement, or more advanced replacing such as --> to ⟶.

In its simplest form, srgn is very similar to tr.
rg, because it can recursively work on all files matching a glob, and matches content with regex support, not tr's concept of character classes
tree-sitter, because content can be matched minding language grammar, making it much more powerful than regular regex. The tree-sitter CLI is great but not a turnkey solution, and doesn't do manipulation.

srgn comes bundled with a bunch of premade tree-sitter queries: this is a convenience feature, supposed to spare end users from having to learn the query language for basic use cases.

In combination, this allows for operations such as:

"in all Python files, replace print with logging.info" (without touching mentions of print anywhere else; see here), or
"for any Go file, fail CI if any struct serializes secret values by accident" (see here)

This project is medium-sized, with 4.5k SLOC (probably 1k+ of which are tests though). I'd appreciate any and all feedback, and am specifically interested in:

High-level questions

does this seem useful (to you)?
did I reinvent the wheel? Miss the forest for the trees? Do existing solutions exist? (I looked and didn't find any)
how are the ergonomics of:
- the CLI? (the main thing)
- the Rust library API? (the tool is CLI-first, but I spent a lot of effort to have a usable library as well, on which the CLI is built)
  - one of the bigger warts here is possibly scoping (== matching which input to process) using tree-sitter. I'm not fully happy with how that's modeled.
    
    For example, there's the concept of "premade queries" (example from Python), but those are plain enums, and not coupled by traits: each language has its own PremadeLANGQuery, and also CustomLANGQuery. They're all fully decoupled; not the worst thing, but it's just copy-pasta. Any changes would need to be repeated manually. I feel like if I returned to this part a year from now, it'd be hard to grok.

Low-level questions

there were some fun but surprisingly edge-casey algorithms to solve for which I couldn't find useful crates. Specifically, from tree-sitter, I get raw Vec<Range<usize>> of whatever it matched; e.g., all raw byte ranges where comments were found in a source code file.

However, languages like TypeScript and Python have string interpolation, and tree-sitter cannot natively return (correct me if wrong; I asked about that over here also, which gives good context), "all strings except string interpolation parts". In Python, this would be "Hello " out of f"Hello {var}". tree-sitter can only return the entire string or {var}. I only want "Hello " as users should be able to manipulate (replace, uppercase, ...) that, without accidentally replacing/renaming variables such as var, which would be an error.

As a solution, I subtract one Vec<Range<usize>> from another: the first one containing matches such as "Hello {var}", the second one {var}, leaving the desired "Hello ". It's a bit like subtraction of sets.

Now, my solution isn't optimal in terms of complexity. The two range collections are sorted (by start) and non-overlapping, and the second, to-be-subtracted one is a strict subset of the former. I don't make full use of these existing pre-conditions, but gave up after a while... Perhaps this can be improved. Pasting the code here for completeness:

pub(crate) fn subtract<T>(mut left: Vec<Range<T>>, right: &Vec<Range<T>>) -> Vec<Range<T>>
where
    T: Ord + Copy + std::fmt::Debug,
{
    let mut res = Vec::with_capacity(left.len());

    #[cfg(debug_assertions)]
    {
        let is_sorted =
            |ranges: &Vec<Range<T>>| ranges.windows(2).all(|w| w[0].start <= w[1].start);
        let is_not_overlapping =
            |ranges: &Vec<Range<T>>| ranges.windows(2).all(|w| w[0].end <= w[1].start);

        for ranges in &[&left, right] {
            trace!("Checking preconditions for ranges: {:?}", &ranges);
            debug_assert!(is_sorted(ranges));
            debug_assert!(is_not_overlapping(ranges));
        }
    }

    'outer: for l in &mut left {
        for r in right {
            if r.end <= l.start {
                // Creeping in "from the left"
                continue;
            }

            if r.start >= l.end {
                // Gone past relevant range, go next
                break;
            }

            if r.start > l.start {
                // A small part to the left is 'free', aka uncovered by `r`; any later
                // `r` will be *even further* right, so we can safely push this part.
                res.push(l.start..r.start);
            }

            l.start = r.end;

            let is_fully_covered = l.start >= r.start && l.end <= r.end;
            if is_fully_covered {
                // This one is unrecoverable no matter what comes next, so skip ahead.
                continue 'outer;
            }
        }

        if !l.is_empty() {
            // Might have been decimated from mutation so much that it's empty now.
            res.push(l.clone());
        }
    }

    res
}

system · March 2, 2024, 1:00pm

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Text parsing with text replacement on match help	13	2022	November 1, 2021
What features would you like to see in a Rust IDE? editors and IDEs	38	7092	September 8, 2020
Idiomatic code fragment helper inside editor help	5	506	December 18, 2021
We need a great general refactoring tool	13	1524	March 18, 2024
Another tree-sitter based highlighter for VSCode announcements	4	2629	August 26, 2019

Tool for language grammar-aware text manipulation

High-level questions

Low-level questions

Related topics