Series of string replacements in Rust

Hey everybody!

As a project to learn Rust, I'm trying to port this library from Python. It is basically a bunch of functions that take Yiddish strings as input, modifies those strings, then outputs them.

This means that there are a lot of repeated string replacements. I have struggled a lot with the borrow checker, but I've found a way of doing these replacements that compiles. Nonetheless I have a feeling I am being inefficient/unidiomatic or just doing something wrong in general... For example, the amount let's in replace_punctuation() seems bad, but I can't figure out another way to do it.

I would be very grateful for any critique as advice for working with strings in Rust! Below is my code:


use regex::Regex;
use std::borrow::Borrow;
//////////
//encoding
//////////

const PAIRS: [(&str, &str); 14] = [    
    ("וּ", "וּ"),
    ("יִ", "יִ"),
    ("ײַ", "ײַ"),
    ("וו", "װ"),
    ("וי", "ױ"),
    ("יי", "ײ"),
    ("אַ", "אַ"),
    ("אָ", "אָ"),
    ("בֿ", "בֿ"),
    ("כּ", "כּ"),
    ("פּ", "פּ"),
    ("פֿ", "פֿ"),
    ("שׂ", "שׂ"),
    ("תּ", "תּ"),
];

fn replace_with_precombined(input: &str) -> String{
    let mut result = input.to_string();
    for pair in PAIRS{
        result = result.replace(pair.0, pair.1);
    }
    result = result.replace("בּ", "ב"); //diacritic not used in YIVO
    result = result.replace("בּ", "ב");
    return result;
}

// When vov_yud==True, these will be preserved as precombined chars:
//      װ, ײ, ױ
fn replace_with_decomposed(input: &str, vov_yud: bool) -> String{
    let mut result = input.to_string();
    for pair in PAIRS{
        if !vov_yud{
            match pair.1 {
                "װ" | "ױ" | "ײ" => (),
                _ => result = result.replace(pair.1, pair.0),
            }
        }  else {
            result = result.replace(pair.1, pair.0);
        }      
    }
    result = result.replace("ייַ", "ײַ");
    result = result.replace("בּ", "ב");
    result = result.replace("בּ", "ב");

    return result;
}

fn replace_punctuation(input: &str) -> String{
    let result = input;
    let re = Regex::new(r"[-]").unwrap();
    let result = &re.replace_all(result, "־"); //YIVO-style hyphen

    let re = Regex::new(r"[′׳]").unwrap();
    let result = &re.replace_all(result, "'");

    let re = Regex::new(r"[″״]").unwrap();
    let result = &re.replace_all(result, "\"");

    return result.to_string();
}


fn strip_diacritics(input: &str) -> String{
    let result = replace_with_decomposed(input, false);

    let re = Regex::new(r"[ִַַָּּּּֿֿׂ]").unwrap();
    let result = &re.replace_all(result.as_str(), "");

    return result.to_string();
}
///////////////////////////////////////////
// transliteration/romanization and reverse
///////////////////////////////////////////

///////////////////////////////////////////
// import loshn-koydesh pronunciation list
///////////////////////////////////////////

fn main() {
    let input = "′׳-″״";
    let stringed = strip_diacritics("װאָס הערט זיך מײַן חבֿר?");
    println!("{}", stringed);
}

Since you are doing a lot of single-char replacements, I would consider iterating over input.chars() and building up your output string char-by-char, using a match expression to decide what to do with each input char. This will reduce the number of times you need to iterate through the input, and the number of intermediate strings you need to create.

For example, something like this:

fn replace_punctuation(input: &str) -> String {
    let mut result = String::with_capacity(input.len());
    for c in input.chars() {
        let next_char = match c {
            '-' => '־',
            '′' | '׳' => '\'',
            '״' => '"',
            _ => c,
        };
        result.push(next_char);
    }
    result
}

For simple cases like that one, you can even simplify the code to:

fn replace_punctuation(input: &str) -> String {
    input.chars().map(|c|
        match c {
            '-' => '־',
            '′' | '׳' => '\'',
            '״' => '"',
            _ => c,
        }
    ).collect()
}

For cases where one char maps to multiple chars, you could use push_str instead of push, or use multiple push calls inside the match.

Note, you’ll still need to do something more complex for multi-char patterns like the ones in PAIRS.

6 Likes

Great idea -- thank you! Will definitely implement this for punctuation. I tried this method with the PAIRS, but gave up because characters like פֿ can actually compounds, made up of multiple characters...

Do you have any recommendations for replacing multiple characters? Turning single characters into multiple characters (e.g. ױ => וי ) seems simple enough, but the other way (turning multiple characters into one character) seems much trickier... Maybe match for two characters at a time then replace? (e.g. if "ו" is followed by "י", then replace them with the single character "י")

Write a single Regex that matches all of your strings, then use a match on what it matched to determine what to replace it with.

fn replace_multi(input: &str) -> String {
    let re = Regex::new(r"(ab|cd)").unwrap();
    
    re.replace_all(input, |captures: &regex::Captures| {
        match &captures[0] {
            "ab" => "AB",
            "cd" => "CD",
            _ => unreachable!(),
        }
    }).into()
}

The best option for efficient substring replacement is to do all replacements in a single pass over the string, as shown above. Also, there's nothing wrong with having a lot of lets in a function — it is not worse than reusing a single variable. But just in order to better understand the language: the first thing you would do for such a rewrite is to stop using temporary lifetime extension, by moving the &s to when variables are used, not when they're declared. (Temporary lifetime extension is useful, but it complicates the picture by essentially introducing extra hidden variables that you can't set the scope of or explicitly borrow or move.)

fn replace_punctuation(input: &str) -> String {
    let re = Regex::new(r"[-]").unwrap();
    let result = re.replace_all(input, "־");
    
    let re = Regex::new(r"[′׳]").unwrap();
    let result = re.replace_all(&result, "'");

    let re = Regex::new(r"[″״]").unwrap();
    let result = re.replace_all(&result, "\"");

    result.into()
}

Now, in most cases this would be sufficient to enable replacing the lets with assignments, but there's something trickier here: replace_all returns a Cow<'h, str> value that might borrow its input (in the case where no replacements were needed). And each one of the replace_alls might make no changes and therefore return the borrow of the previous stage, which then has to continue existing until it can be returned. So, actually, the sequence of lets is the straightforward efficient way to write this code, with the least amount of unnecessary copying (given the premise that we're calling replace_all repeatedly and not doing something more efficient than that) unless you start matching the Cow to handle the Borrowed and Owned cases separately.

Here's the inefficient version with extra string copies:

fn replace_punctuation(input: &str) -> String {
    let mut result: String = input.to_owned();
    
    let re = Regex::new(r"[-]").unwrap();
    result = re.replace_all(&result, "־").into();
    
    let re = Regex::new(r"[′׳]").unwrap();
    result = re.replace_all(&result, "'").into();

    let re = Regex::new(r"[″״]").unwrap();
    result = re.replace_all(&result, "\"").into();

    result
}

And we can then turn that into a loop:

fn replace_punctuation(input: &str) -> String {
    let mut result: String = input.to_owned();

    for (regex, replacement) in [
        (Regex::new(r"[-]").unwrap(), "־"),
        (Regex::new(r"[′׳]").unwrap(), "'"),
        (Regex::new(r"[″״]").unwrap(), "\""),
    ] {
        result = regex.replace_all(&result, replacement).into();
    }

    result
}

Or if you want to avoid the extra string copies incurred by the into() conversion from Cow<str> to String, it's also possible, but you have to dive a little deeper into the nuances of borrowing and Cow:

use regex::Regex;
use std::borrow::Cow;

fn replace_punctuation(input: &str) -> String {
    let mut result: Cow<'_, str> = Cow::Borrowed(input);

    for (regex, replacement) in [
        (Regex::new(r"[-]").unwrap(), "־"),
        (Regex::new(r"[′׳]").unwrap(), "'"),
        (Regex::new(r"[″״]").unwrap(), "\""),
    ] {
        let output: Option<String> = match regex.replace_all(&result, replacement) {
            Cow::Borrowed(_) => {
                // `replace_all` guarantees that this case means the string is
                // unchanged. No further action is needed.
                None
            }
            Cow::Owned(s) => {
                // A new string is needed.
                Some(s)
            }
        };

        // `output` is `Some` if and only if there is a new string to store.
        // By separating the above and below code and passing only the
        // `Option<String>`, we avoid borrow conflicts; the `&result` passed
        // to the `replace_all()` call has been dropped, so we are free to
        // reassign `result`.

        if let Some(s) = output {
            result = Cow::Owned(s);
        }
    }

    result.into()
}

Again, I don't recommend actually using this code, because it is not the most efficient way to do multiple replacements. I'm posting it only to illustrate general things about let, borrowing, and Cow.

2 Likes

For multi-replacements it is much more efficient to use the aho-corasick crate.

5 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.