Fallible replace_all for regular expressions

A closure that I pass to regex::Regex::replace_all may fail. I don't want it to panic in that case but retrieve the error. I tried this:

use regex::{Captures, Regex};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let re = Regex::new("[0-9]+").unwrap();
    let s = "123, 12345678901234567890";
    let mut error: Option<Box<dyn std::error::Error>> = None;
    let _s2 = re.replace_all(s, |caps: &Captures| {
        let n_result: Result<i32, _> = caps[0].parse();
        match n_result {
            Ok(n) => (n * 2).to_string(),
            Err(err) => {
                error = Some(format!("invalid number encountered: {err}").into());
                String::new() // dummy
            }
        }
    });
    if let Some(err) = error {
        Err(err)?;
    }
    Ok(())
}

(Playground)

Errors:

   Compiling playground v0.0.1 (/playground)
    Finished dev [unoptimized + debuginfo] target(s) in 0.94s
     Running `target/debug/playground`
Error: "invalid number encountered: number too large to fit in target type"

That works but it's somewhat ugly. It also will not short-circuit on error.

Implementing Replacer myself, which is a possible solution in this case, doesn't seem to work either, because Replacer::replace_append isn't fallible.

Does that mean I would have to implement the code of Regex::replacen on my own or live with the solution above (that involves a temporary variable and inserting empty dummy strings)?

P.S.: My real-world use case uses floats, so I guess it's okay for me to just parse them as f64::NAN (example in Playground), but I'm still interested in how this problem would be solved in general. Maybe it's a corner case that doesn't occur often.


P.P.S.: Just encountered another case where I can't use f64::NAN as a workaround. :face_with_diagonal_mouth:

The second real-world use case looks like this:

fn render<'a>(&self, machine: &LuaMachine<'a>) -> anyhow::Result<String> {
    let output = RE_CODE.replace_all(&self.template, |caps: &Captures| {
        if let Some(assign) = caps.name("assign") {
            let (line, col) = line_column(&self.template, assign.start());
            let name = format!("assignment in line #{line}, column #{col}");
            let func = machine
                .compile(Some(name), assign.as_str())
                .context("could not compile Lua assignment")?;
            func.call([])
                .context("runtime error when executing Lua assignment")?;
            String::new()
        } else if let Some(expr) = caps.name("expr") {
            let (line, mut col) = line_column(&self.template, expr.start());
            col -= 1;
            let name = format!("expression in line #{line}, column #{col}");
            let func = machine
                .compile(Some(name), format!("return tostring({})", expr.as_str()))
                .context("could not compile Lua expression")?;
            func.call_1ret([])
                .context("runtime error when executing Lua expression")?
                .try_into_string()
                .context("Lua did not return a string")?
        } else {
            unreachable!()
        }
    });
    let output = RE_COMMENT.replace_all(&output, "");
    Ok(output.to_string())
}

And I obviously can't use the ? operator within the closure passed to replace_all.

A (sort of) short-circuiting variant would be this:

use regex::{Captures, Regex};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let re = Regex::new("[0-9]+").unwrap();
    let s = "123, 12345678901234567890";
    let mut error: Option<Box<dyn std::error::Error>> = None;
    let _s2 = re.replace_all(s, |caps: &Captures| {
        if error.is_some() {
            return String::new(); // dummy
        }
        match (|| {
            let n: i32 = caps[0].parse()?;
            Ok::<_, Box<dyn std::error::Error>>((n * 2).to_string())
        })() {
            Ok(x) => x,
            Err(err) => {
                error = Some(err);
                String::new() // dummy
            }
        }
    });
    if let Some(err) = error {
        Err(err)?;
    }
    Ok(())
}

(Playground)

But… it feels even more ugly than my first attempt. :slightly_frowning_face:


Or maybe do two runs?

use regex::{Captures, Regex};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let re = Regex::new("[0-9]+").unwrap();
    let s = "123, 12345678901234567890";
    let mut replacements: Vec<String> = Vec::new();
    for caps in re.captures_iter(s) {
        let n: i32 = caps[0].parse()?;
        replacements.push((n * 2).to_string());
    }
    let mut replacements_iter = replacements.into_iter();
    let _s2 = re.replace_all(s, |_: &Captures| {
        replacements_iter.next().unwrap()
    });
    Ok(())
}

(Playground)

According to the docs you linked to:

If you need to write a replacement routine where any individual replacement might “fail,” doing so with this API isn’t really feasible because there’s no way to stop the search process if a replacement fails. Instead, if you need this functionality, you should consider implementing your own replacement routine.

If you explain what your exact use case is, maybe someone could suggest another method. It's not really clear to me from the code you've posted.

Oh, I should have read that documentation first. :sweat_smile: Though it still doesn't really suggest a solution for the problem.

Basically I replace occurences of [some Lua code] in a template with the result of the execution of the code (which, of course, is fallible). E.g.:

Hello [math.sin(1)]

would result in:

Hello 0.8414709848079

And if the input string is:

Hello [syntax!error]

I would like to report some sort of error (because syntax!error isn't a valid Lua expression).

Where does the error get sent? Is there an interactive user interface or do you log the error to some text file?

Also, are you extracting anything between a [ bracket and the nearest ], passing it to the Lua interpreter and replacing the expression with the result?

My current plan is to let main return an anyhow::Result, so the error (including the Context stack) gets printed to stderr. This is already nicer as simply panicking, though I could imagine writing a customer error reporting routine at a later stage.

Yes, exactly that.


I decided that maybe this way is best:

fn render<'a>(&self, machine: &LuaMachine<'a>) -> anyhow::Result<String> {
    let mut evaluated: Vec<String> = Vec::new();
    for caps in RE_CODE.captures_iter(&self.template) {
        evaluated.push(
            /* some code here that may fail and return early */
        );
    }
    let mut evaluated = evaluated.into_iter();
    let output =
        RE_CODE.replace_all(&self.template, |caps: &Captures| evaluated.next().unwrap());
    let output = RE_COMMENT.replace_all(&output, "");
    Ok(output.to_string())
}

So basically pre-calculating all strings to be used as a replacement in a for loop that uses the Regex::captures_iter method. As this doesn't require a closure, I can use the ? operator there.

Not nice but… works, I guess.

Could you not do something like this?

// extract anything between a square [ bracket
// and the closest ]
// e.g.
// Hello [math.sin(1)] and [syntax!error]
// =>
// "[math.sin(1)]"
// "[syntax!error]"
let re = Regex::new(r"\[[^\]]+?\]").unwrap();

let hay = "Hello [math.sin(1)] and [syntax!error]";

let snippets = re.find_iter(hay).map(|m| m.as_str());

for snippet in snippets {
    println!("{snippet:?}");
    // do whatever with the code snippet here
}

You would have to do the replace operation yourself. Assuming each code snippet is independent from the others, you could just do a simple find and replace with a simple string replace().

The idea is to insert the output of each code snippet into the template.

E.g. if the template is:

One plus on is [1+1].

I want the [1+1] to be replaced by what happens when I run "return 1+1" as Lua code. This means I need to replace and each replace operation is fallible because the code could exhibit syntax or runtime errors.

Sure, I get that, but you don't need to do everything in one step.

My suggestion is to use find_iter(), which will yield every template one by one. On each iteration of the loop, you can call the Lua interpreter and get the result. If it's an error, you can do your customized error handling, and if it's all right, you can do a simple string find-and-replace (no need to use a Regex).

How do I do that? I mean: I need to match r"\[[^\]]+?\]", which is a regular expression and not a simple string.

let s = "my code with [1 + 1]";
s.replace("[1 + 1]", "2");

Regex can help you find [1 + 1], but you don't need to use it for the replace operation.

It does. It suggests implementing your own replacement routine. It's unclear to me why you're hesitant to do that. See: What's the idiomatic approach for Replacers that may fail? · Issue #648 · rust-lang/regex · GitHub

2 Likes

If I replace each snippet one at a time, I allocate a new String each time. Doesn't seem to be a good way.

Hmm, okay, that's basically what I wrote here:

Which is more or less the same code as suggested by the docs. edit: not really, see below.

I guess I feel like it's an implementation detail that I would like to be handled by a library. But looks like doing "manual" string operation is the idiomatic/suggested way here.


Looking at the source code of Regex::replacen, I see several optimizations, for example, which I would lose if I write some short code myself.

Then add those optimizations if you need them. They generally only make sense in a generic context. The replacement routine is trying to be different things to different people with different use cases. So for example, if your replacement string doesn't have any capture group names in them, then it doesn't need to use captures_iter and instead can do find_iter. But if you're implementing your own replacement routine for a specific case, you presumably already know whether you need capture groups or not.

Otherwise that's pretty much it other than returning a borrowed Cow in the case where no matches are found.

"I lose optimizations" devoid of any context doesn't mean anything to me. Write the code that works. Is it too slow? Okay, then circle back around and investigate making it faster. But if the simple approach is fast enough, then just leave it alone.

3 Likes

Well, it boils down to:

  • regex doesn't support fallible replaces.

If I need them, I got to implement them myself. That's okay, and I can work with it.

I still don't think it's optimal, but maybe fallible replaces are too exotic to really warrant providing a lot of duplicate methods (one for the infallible and one for the fallible operation). So I understand if you don't want to extend the API in that matter. (And even if you did, what about async replaces, etc. etc.).

Side note: this seems to be an issue not specific to regex but an issue of Rust. It often happens when an API uses closures instead of iterators:

  • The ? operator won't work.
  • async won't work.

(Unless you provide a fallibe, or async, API, respectively, which is unergonomic in non-async and infallible cases.)

Yes, there are broader challenges here indeed. API design is hard, and it's often much harder to know what to leave out than to what to leave in. I don't have any hard evidence that fallible replaces are not exotic for example. I just have a guess that the vast majority of folks needing replacements do not need a fallible closure.

Exacerbating this issue is the fact that there are three different replacement routines: replace, replacen and replace_all. So does adding a try variant meant we need three more replacement routines? That starts to clutter up the API. Maybe we could just get away with try_replacen, but in my experience folks will continue to ask for the other two methods as well.

How much of a problem this is IMO depends on what you're saving by providing more APIs or more complicated APIs. If it were saving something that was impossible or extremely difficult to implement on your own, then the case for adding fallible replacement routines strengthens. But that's just not the case here. The thing being saved is a few lines of code that are just not that hard to write. So instead of forcing the ~99% use case to pay the API tax of more complexity, I chose to force the ~1% of the population to pay the implementation tax to write their own code.

I am generally willing to revisit these decisions as new evidence or arguments appear. Otherwise, it just really boils down to a judgment call.

I sometimes wished Rust's type(?) system would be more advanced in some ways, so these duplications wouldn't be necessary. Afterall, fallibility and async are (from a functional programming point of view) the same: Monads. :grin:

I don't. Abstraction solves all problems except for the problem of abstraction. But that's a broader philosophical question that's out of scope here.

2 Likes

Well, there's a continuum. std also provides Result and Option instead of proposing every API to implement their own type constructor for that.

And I feel like with HRTBs, async iterators (streams), etc., there is a trend to put more and more abstractions into Rust (or its ecosystem).

I do understand that too much abstraction can be harmful too, but from a practical p.o.v, I feel like Rust developers still face having to write boilerplate code in several occasions (as in the the case discussed here), which is, in my opinion, not a good sign for a modern programming language.

Of course it's a continuum. You expressed one point on that continuum and I expressed a different point on that same continuum. I'm not against all new expressive features, but I'd rather be much more circumspect about it. And in particular, I am perfectly okay with some boiler plate here and there because IMO, that optimizes better for reading code than writing code. That is, I see overly generic code to be much harder to read, and usually this comes with the benefit of serving niche use cases at the expense of making things more difficult to understand for the common case.

I'm muting notifications from this thread. I've had this exact conversation with dozens of people over the last years. It always ends the same way.

2 Likes