Optimal approach to handle concurrent match and replace operations

this is kind of two questions rolled into one: I'm working on trying to parse a vector of strings, with multiple replacement operations that need to be done, and create a new vector of strings while minimizing repeated work.

essentially I'm working on a superset of the snippet syntax defined by the lsp specification that would support modular snippets, ie snippets that could be composed of snippets, which I believe would be easier to handle separately from the starting character for placeholder values and variables($). Here's an example of a few snippets below:

[if]
name = "if"
body = ["if ${1:expression}:", "\t${2:pass}"]

[else]
name = "else"
body = ["else:", "\t$1"]

["if/else"]
name = "if/else"
body = ["@if", "@{else}"]

my first question(and probably a shining example of premature optimization) is how do I efficiently structure the replacement code for the syntax as is?

I'm not exactly trying to do the operation in place, as the resulting vector of strings is going to be longer, so I figured it would be more efficient to just create a new vector and either insert strings or append the contents of other vectors than to try to insert the contents of a new vector into the middle of an existing vector.

I'm also pretty sure this is an impossible task for a single regex, but it might be optimal to use multiple precompiled regexes depending on the current match group. I think I could use match arms on the capture group to do things like count placeholders and pass an offset to functions fetching the referenced snippets. the one thing I'm considering but not sure if it's viable is passing slices of the string to functions, based on indices of match starts and match ends(treating each submatch as independent), because I don't know if it's possible to do things like get the opening and closing placeholder without backreferences or lookaround( something like ${1:example_of_embedded_placeholder${2}} would capture the first } if operating on the entire string), though my initial idea to just grab the leftmost } wouldn't work either(ex: ${1:somebody_who_hates_coders}={$2:None})

my second question is since there isn't really a hard requirement for the syntax being a true superset, if I wanted to make changes to the syntax to make it faster to parse, while still making it easy to write, what would you guys suggest?

EDIT

since this may not be clear, I'm having to do these operations:

  1. count and replace placeholders with placeholders plus offset
  2. recursively rebuild the children snippets with offsets,
  3. potentially make specific placeholders of the children snippets match supplied placeholders from the toplevel snippet. ex @{else(1&)}
  4. potentially replace specific placeholder of the children removing them as a snippet. ex @{else(1!:break)}
  5. variable substitutions(though not worried about that now)
  6. match things like {}()etc ONLY if inside a brackets after a character signifying that the snippet parser needs to do work
    6.finally spit out a new vector of strings composed of the summation of all the previous work

the major subproblems seems to be matching brackets with the appropriate closing bracket, as well as finding ways of keeping track of indices, because I'm thinking the best solution would be if I could just split the string(s) into the different non-overlapping substrings, then passing the slices to different regex matchers/functions, then recombing the result. since the functions operate on slices I'm thinking that should still be inplace until the step where the results have to be concatenated.

This is is not possible to do with only regular expressions; you will need a more powerful parsing strategy for at least this part. The easiest strategy is probably to do this in phases:

  1. Build a syntax tree where each snippet name to be replaced is a single node.
  2. Transform the tree into one that contains only output text (using the snippet definitions).
  3. Flatten the resulting tree into a vector.

thank you for your response. I'm reading up on syntax trees in rust right now. right now I'm trying to decide between using pest and nom, either way both feel better suited to the task than trying to solve the problem with multiple regexes

To make sure I'm understanding what you are saying: do you mean build an AST out of that particular snippet and then parse with a predefined parser or make a syntax tree for parsing out of all available options (to use when parsing snippets)?

if you mean the latter, I already have the snippets stored in a hashmap, I went with a lazy approach of rebuilding the snippets when they are called, mainly because there was no way to guarantee ordering when deserializing. I also want to avoid statically defining replacements(outside of variables) because I'm trying to make it where snippets can be loaded at runtime(trying to support library based snippets, for example loading numpy snippets if the user imports numpy.

the former is what I'm currently aiming for, and hoping that is possible in a way that's dynamic(not too much knowledge of what's available other than the ordering of snippet set loading hopefully will guarantee that the snippets will be available when called).

the last two steps sound pretty similar to the code I'm using to deserialize the snippets from toml files.

#[derive(Deserialize, Clone, Debug)]
pub struct Loader {
    #[serde(flatten, with = "tuple_vec_map")]
    pub(crate) snippets: Vec<(String, Snippet)>,
}
...

pub fn load(&mut self,language: &str,snip_set_name: &str, snippet_data: &str){
    
        let temp: Loader = toml::from_str(&snippet_data).unwrap();
        let mut snippet_set: Vec<String>= Vec::with_capacity(temp.snippets.len());
        for (snippet_key,snippet) in temp.snippets.iter(){

            self.snippets.insert((language.to_string(),snippet_key.to_string()),snippet.to_owned());
            snippet_set.push(snippet_key.to_string());
        }
        self.snippet_sets.insert((language.to_string(),snip_set_name.to_string()),SnippetSet::new(snippet_set));
        
    }

I don't know if this helps, but serde_json has a "preserve_order" feature.

I've been debating whether to switch to storing the files as json rather than toml, mainly because it seems like serde_json lib has a few features that are missing from the serde_toml lib, plus it would make it possible to use vscode snippet files directly (to use as a fallback and to ease migration)

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.