Text parsing with text replacement on match

Hello everyone,

I'm looking for advice on how to parse and replace text on pattern matching.

Goal

I write notes in Markdown and want to build a tool that parses my notes and replaces texts when cretin defined patterns are found. I would also love to have the ability to extend my Markdown notes with custom functions. Perhaps if I have something like <date> it would replace that string with today's date.

I also would like to do it all in rust so I can compile and not worry about having dependencies. So for example, I would like to avoid using something like sed for the text replacement or any other system commands.

Example

# These are my notes
I wrote these notes on <date>.

## Math class
The Algebra 101 class is located in room 243.

## Science class
Prof. Jigi-wigly is teaching Climate 201 in room 322.
Class will start in <date-2021-11-03>.
  • I would like to replace the pattern <date> with today's date.
  • Lets pretand my school has a site https://www.myschool.com/class/#CLASS.html and https://www.myschool.com/class/#ROOMNUMBER.html
    • I would like it to find all patterns of classes and room numbers and replace it with a link.
      • The Algebra 101 class is located in room 243. becomes The [Algebra 101](https://www.myschool.com/class/Algebra_101.html) class is located in [room 243](https://www.myschool.com/class/room_243.html).

The Question

I know this can be done with regular expressions but I want to know if there's a more elegant way to do it. Nom would be cool except it seems to me like it would be overkill. I found lalrpop a few minutes ago, I'm still not sure if it's a good match for what I'm doing or like nom, if it's over kill. There are many other parser but not sure which would be the simplest for my use case.

I'm defining overkill as over-complicating a simple problem. Maybe I'm wrong to think nom and lalrpop are overkill.

What library/method would you use?

Hello there =D

thankyou for the hint to larlpop. Here are my 2cents. If I were you I would:

  • try to not be overly perfect
  • find a practical simple solution
  • keep in mind softare changes and especially your demands on it will too

For the example use case you described, regular expressions are perfectly fine.

Regular Expressions are way easier to handle for these small substitutions you are looking for. Read below for a real life use case of mine in a similar matter.

These steps may help you:

  1. Write a prototype. A simple and effective (NOT necessary efficient) one for only one use case. Focus on getting the task done. Leave performance and code polishing for future iterations. You just want to get things done and you need to develop an understanding of the problem at hand. Design concepts are great. You learn from them. Programming is great. You learn from realizing those concepts.
  2. Write another prototype for a similar other use case. Maybe write from scratch if you are not fully satisfied with the first approach. Or maybe you would like to try another route. That is totally fine. Again: get things done also for the second one. The second prototype will be easier to write.
  3. use your prototypes
  4. maybe you write more prototypes; a bash script or similiar will help you to run your prototypes in sequence over your files. Keep this effective and simple.
  5. decide if you want to continue your project
  6. only if you want to continue:

  7. start another project which combines your gathered knowledge and prototypes so that your prototypes work in concert

my similar use case

I myself hacked together a program which allowed me to write ascii math in markdown. The ascii math code was gathered by delimiters with regular expressions. It was written in JavaScript, because I used a JS lib which converted ascii math to latex. The "application" was contained in a simple html file. Input was a html form. To start the "app" I opened the html with embedded JS-code in Firefox. It worked for me for several documents with large formulas. I stopped writing equasions on paper and wrote and transformed them with ascii math. Yet it was a prototype which wasn't even near good code quality. The prototype delivered what it was meant to do and I coded it with very little knowledge on the topic in one or two days. I am not proud of the code I wrote. I am proud of the quality documents I made. Not the prototype was my final product, the math teaching documents were. And they served well as did the prototype for me.

why I find this advice important

Currently I write a parser generator myself, because I am unsatisfied with the ones I used before. I tried Nom and Pest. I do this, because I am unsatisfied with the HTML-Template Languages for Rust. I tried handlebars, tera and ructe, of which ructe is my favourite. The goal of the parser generator is to provide a base for my own template language. As you can see, I am not always following my own advice. Looking back at how much time I invested on the parser generator already, I should listen to my advice in the future more carefully. Anyway, I do this currently for fun and I do not have much time pressure. So this situation is kind of ok I guess. I am writing this as a bad example to show you how easy it is to get lost in your project which started ever so small.

What you want seems to be a simple macro processor. I write all my parsers by hand, that's a lot of fun, the easiest technique being recursive descent. But for such a simple processor, you not even need an advanced parser.

A simple solution to the problem presented is, for example, the following.

fn date(acc: &mut Vec<u8>) {
    acc.extend("0000-00-00".as_bytes());
}

// The linear search may be replaced by binary search
// or a HashMap lookup, once there is a huge or dynamic
// amount of macros.
static MACROS: &[(&[u8], fn(&mut Vec<u8>))] = &[
    (b"date", date)
];

fn expand_macro(acc: &mut Vec<u8>, name: &[u8])
-> Result<(), String>
{
    for t in MACROS {
        if t.0 == name {
            t.1(acc);
            return Ok(());
        }
    }
    Err(format!("Macro '{}' not found",
        std::str::from_utf8(name).unwrap()))
}

fn expand(text: &str) -> Result<String, String> {
    let a = text.as_bytes();
    let mut i = 0;
    let n = a.len();
    let mut acc: Vec<u8> = Vec::with_capacity(n);
    while i < n {
        if a[i] == b'<' {
            i += 1;
            let j = i;
            while i < n && a[i] != b'>' {i += 1;}
            i += 1;
            expand_macro(&mut acc, &a[j..i-1])?;
        } else {
            acc.push(a[i]);
            i += 1;
        }
    }
    Ok(String::from_utf8(acc).unwrap())
}

fn main() {
    println!("{:?}", expand("I wrote these notes on <date>."));
}

Oh wow, I have a lot to go over with the example. :face_with_monocle:
I'm currently trying to figure out what's going on with the playground.

Thanks!

Thank you for the solid advice! I think I will create a prototype as mentioned.

For the example use case you described, regular expressions are perfectly fine.

I'm glad to hear that. I'm confidant I can make it work with regular expressions but where I'm at now is how I'm going to organize the code so that if I want to add new patterns, I can drop in a new pattern and have it work without creating spaghetti code.

Would love to drop in new patterns and have the code iterate over all | subset of the pattern. Kinda like using plugins or something.

For example:

fn main() {
    let links = vec!["link-rooms", "link-classes", "link-email"];
    let extention = vec!["ex_date"];
    todo!();
}

mod linkify {

    fn link_rooms(text: String) -> String {
        // Regular expression for finding rooms

        // Replace all room patterns with links to the URL

        // Return the transformed text
        todo!()
    }

    fn link_classes(text: String) -> String {
        // Regular expression for finding classes

        // Replace all class patterns with links to the URL

        // Return the transformed text
        todo!()
    }

    fn link_email(text: String) -> String {
        // Regular expression for finding email address

        // Replace all room patterns with links to mailto: email

        // Return the transformed text
        todo!()
    }
}

mod extentionfy
{
    fn ex_date (text: String) -> String {
        // Regular expression for finding date pattern
        
        // Replace all date patterns with today's date
        
        // Return the transformed text
        todo!()
    }
}

Playground

Then with the vectors, I can iter each function on the text body and save when it's done. No idea if this works or is a good way of doing it. I guess that's what the prototypes are for, huh? :grinning_face_with_smiling_eyes:

Thank you a ton for the great reply, I appreciate it!

Update

I found an article dealing with building a minimal plugin architecture in Python, it's interesting.

Supplementary to the example given, a variant that dispenses with u8:

fn date(acc: &mut String) {
    acc.push_str("0000-00-00");
}

static MACROS: &[(&str, fn(&mut String))] = &[
    ("date", date)
];

fn expand_macro(acc: &mut String, name: &str)
-> Result<(), String>
{
    for t in MACROS {
        if t.0 == name {
            t.1(acc);
            return Ok(());
        }
    }
    Err(format!("Macro '{}' not found", name))
}

fn expand(text: &str) -> Result<String, String> {
    let mut acc = String::with_capacity(text.len());
    let mut it = text.chars();
    while let Some(c) = it.next() {
        if c == '<' {
            let rest = it.as_str();
            let mut len = 0;
            while let Some(c) = it.next() {
                if c == '>' {break;}
                len += 1;
            }
            expand_macro(&mut acc, &rest[..len])?;
        } else {
            acc.push(c);
        }
    }
    Ok(acc)
}

fn main() {
    println!("{:?}", expand("I wrote these notes on <date>."));
}

Actually, this one is flawed. Use a Unicode macro name to spot the problem.

An implementation of expand that allows macro names to contain Unicode:

fn expand(text: &str) -> Result<String, String> {
    let mut acc = String::with_capacity(text.len());
    let mut it = text.char_indices();
    while let Some((i, c)) = it.next() {
        if c == '<' {
            let rest = it.as_str();
            let mut len = rest.len();
            while let Some((j, c)) = it.next() {
                if c == '>' {len = j - i - 1; break;}
            }
            expand_macro(&mut acc, &rest[..len])?;
        } else {
            acc.push(c);
        }
    }
    Ok(acc)
}

Here is some code to get you started with your project prototype. Let me know if you need explanations to the code.

Add the regex crate via the cargo.toml

[dependencies]
regex = "1"

Here is the main.rs file ^^

// let's use function pointers to setup our expansion process sequence.
// adding a new expansion process is as easy as writing a new function and adding its name to the sequence.
// reference: https://doc.rust-lang.org/reference/types/function-pointer.html

// problem with the array: you must name the number of functions in the array in the declaration.
//here-----------------------------------------------v
const expand_functions_array: [fn(String) -> String; 2] = [expand_name, expand_class];

// instead of the const array you could use a Vec:
fn expand_functions() -> Vec<fn(String) -> String> {
    vec![expand_name, expand_class]
}

use regex::{Captures, Regex};
use std::{collections::HashMap, fs, io};

fn main() -> io::Result<()> {
    let text = fs::read_to_string("./file.txt")?;
    let result = expand(text);
    // print it to use it with bash scripts and in/output redirects and pipes
    println!("{}", result);
    // or save it to a file
    fs::write("output.txt", result)?;
    Ok(())
}

fn expand(text: String) -> String {
    let mut result = text;
    for expand_function in &expand_functions() {
        result = expand_function(result);
    }
    result
}

fn expand_name(text: String) -> String {
    text.replace("<name>", "nuiun")
}

fn expand_class(text: String) -> String {
    let mut links = HashMap::new();
    links.insert("Algebra 101", "https://algebra101.link");
    links.insert("Software Development", "https://software-dev.link");

    // replace reference: https://docs.rs/regex/1.5.4/regex/struct.Regex.html#method.replace
    // we use replace_all which gives the same API while replacing every occurence
    let regex = Regex::new(r"\[(.*) class\]").unwrap();
    regex
        .replace_all(text.as_str(), |captures: &Captures| {
            // beware: &captures[0] is the match of the whole regex [(.*) class\].
            // &captures[1] is the match of the subgroup (.*)
            let the_class = &captures[1];
            let link = links.get(the_class).unwrap_or(&"cannot find link");
            format!("[{} class]({})", the_class, link)
        })
        .into()
}

For the example code above, try this file.txt as input:

Good to meet you <name>! Welcome to [Algebra 101 class]. I am glad to see you <(^_^)>

Hallo <name>, welcome to our new [Software Development class].

Oh no! This is an [Unknown class].

It will give you this output in the terminal and in output.txt:

Good to meet you nuiun! Welcome to [Algebra 101 class](https://algebra101.link). I am glad to see you <(^_^)>

Hallo nuiun, welcome to our new [Software Development class](https://software-dev.link).

Oh no! This is an [Unknown class](cannot find link).

Fantastic!

The problem I've been trying to solve for a few weekends now is how to ignore tags that already have links so that I can run the application on a file multiple times without adding links when links already are present.

Example, the following should be ignored instead of linkified:

[Algebra 101 class](htts://link.com])

I went ahead and modified your code so that it would run on the Rust Playground in case anyone out there searching wants to see the code in action.

My attempt at skipping expended links with regex --- Look-around

So I solved the problem of skipping classes, e.g. [Algebra 101 class](https://www.some.link), by using only regular expressions in an attempt to keep the code lean but I ran into issues with more complex regex patterns.

Look-ahead seemed to be the solution but it doesn't work since the using look-ahead is atomic so the regex engine doesn't compare the entire pattern once it hits a quantifier.

Using negative look-ahead

First thing first, look-around can not be used with regex library. The author didn't and won't be implementing look-around because. https://github.com/rust-lang/regex/issues/127#issuecomment-310152484

So I went ahead and swapped out regex with fancy_regex::Regex which is based on regex but with additional features such as look-around.

For the example of expending the classes, it works fine. The \[(.*) class\](?!\() regex. It takes care of matching everything \[(.*) class\] except those immediately following (.

The problem when using look-around

I hit a wall when I use negative look-head with more complex pattern searching. Say I'm writing research that contains Bible scriptures and would like to expand scriptures to a link to an online Bible, look-around fails.

Example:

As you can see, the expression (?<!\[)(([1-2]\s?)?(\w+ \d+:\d)([,-]\s?\d+)?)(?!\]) ran on [Exodus 10:5-10](www.link.com) doesn't select the E or the last 0 but everything between is selected.

If I put the scriptures between tags then it simplifies everything but it would be nice that tags are not needed. RegExr: Learn, Build, & Test RegEx

I'm reaching a point where I'm accepting the fact I will need to put tags around the scripture pattern...

1 Like

Context sensitivity is often the point where regular expressions break. You have two options:

  1. use a more powerful technique: parsing with grammars which leads again to nom or lalrpop. Though this will be a difficult problem to solve with those tools.
  2. use a regex and match one or more differentiators. Then you can decide on the differentiators what to do with the match.
// regex:
(\[)?((?:\d+ )?\p{L}+ \d+:\d+(?:-\d+)?(?:, \d+)?)

// input:
Romans 2:3
[Romans 2:3](www.link.com)

Exodus 10:2, 3
[Exodus 10:2, 3](www.link.com)

Exodus 10:5-10
[Exodus 10:5-10](www.link.com)

1 Timothy 5:2
[1 Timothy 5:2](www.link.com)

// matches with match groups in round brackets "(...)"
()(Romans 2:3)
([)(Romans 2:3)
()(Exodus 10:2, 3)
([)(Exodus 10:2, 3)
()(Exodus 10:5-10)
([)(Exodus 10:5-10)
()(1 Timothy 5:2)
([)(1 Timothy 5:2)

// the first match group is our differentiator
// we only expand those: for which the first match group is empty
()(Romans 2:3)
()(Exodus 10:2, 3)
()(Exodus 10:5-10)
()(1 Timothy 5:2)

// regexr.com/631qv

Good tool you linked there =D I used https://regex101.com/ before.

1 Like

That's a clever solution, I'm going to give that a shot later this week.

Using nom is tempting because I have to start using it for another project of mine, Deserializing a .dat binary file created in CPP, where I need to parse a binary file. Getting it working with regex seem like the way to go though.

1 Like

using nom without prior knowledge of parsing is a challenge. I had knowledge on parsing techniques it was a challenge for me. I felt most of the challenge was from trying to understand the overwhelming use of generics in nom. Error handling with nom was the next challenge I have not mastered yet. nom has a lot of indirections alike: call a function, which returns a function, which you then immediately call. That caused me a headache and big question marks rising up in my head. I got used to it after I experimented and wrote a lot of prototype code. Then I found a way of using it in a way I understood what it is doing. The result isn't pretty. Yet it works.

Your use case is kind of unusual for parsing. It goes into the direction of parsing natural language. That is a real heavy challenge. See it this way: you have some kind of plain text which should remain as it is. Then you have parts of the text the program must recognize as special sequences and they must be processed. The problem here is: what exactly is "plain text"? In a typical programming language you don't have such things as "plain text". Everything is syntactically well defined (most of the grammar at least).

Markup languages like XML and HTML take the challenge by strictly differentiating which sequences are "plain text" and which sequences are only there to structure the text or give meta information. That makes them way easier to parse. Take <p>Hello there!</p> for example. <p> is a tag, because it is embraced in < and >. Hello there! is plain text, because it is not embraced.

Markdown takes the approach to use sequences like ``` as special markers like tags. Another approach markdown takes is to give some symbols a double role: they define syntax structure and they would be written in plain text anyway to build lists for example. Like this:

- item a
- item b
- item c

which builds

  • item a
  • item b
  • item c

The '-' is a syntactic element. We would have written it anyway in plain text to build the list. That is what markdown does really well: taking symbols we would write in plain text anyway and use them as syntactic elements. It is a smart trick to create a language that is easily read and written by humans and more or less easy to parse by computers at the same time.

As I showed above it is possible to create something that just works in your case. It is just not a usual programming language parsing use case. To solve the problem above I thought about: why do I as a human understand what should be expanded and what should not? Then I realized I differentiate the pattern by analyzing one pattern has brackets and the other does not. That is what the computer can do too.

There are problems where we have a sense of what we probably would do, but it is hard to define. It is more or less a stomach decision. That is what computers without using AI, neuronal networks and/or fuzzy logic and such cannot do. An example where this has been solved by strictly defining what the computer should do in an ambiguous case is the dangling else problem:

if v1 then if v2 then do_a else do_b

To which if does the else belong? In C and Java the else belongs to the last if. We could have defined it the other way though too.

There is not something like this is the way you must go. Real world compilers have quirks inside to make them work and do what they should. Of what I learned is most important to me: Get the job done. There is always a new problem we can and want to solve. Time is running. The time we spent polishing one solution could have been time we could have invested into investigation and solving other problems.

So many words... Hope you enjoyed reading them xD I wanted to show you this is not a trivial topic and there is much more to read and learn then you might have thought of.

For the future you might be interested in these libraries:

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.