Convert `†` to unicode character

I got a text where Unicode characters are represented like ሴ and I am trying to convert them to Unicode characters recognizable by Rust:

How can I accomplish it?

I tried the following:

fn fix_unicode(s: &str) -> String {
    lazy_static! {
        static ref UNI: Regex = Regex::new(r"&#x(?P<n>[A-F0-9]+);").unwrap();
    }
    UNI.replace_all(s, "\\u{$n}").into_owned()
}

But instead of unicode I get something like
Syst. Verz. S\\u{00E4}ug. II. 1845

You can use from_str_radix to parse the hex numbers.

fn to_char(s: &str) -> char {
    let hex = &s[3..7];
    let val = u32::from_str_radix(hex, 16).unwrap();
    char::try_from(val).unwrap()
}

playground

1 Like

And you can use a replacer function to apply this to your regex:

fn fix_unicode(s: &str) -> String {
    lazy_static! {
        static ref UNI: Regex = Regex::new(r"&#x(?P<n>[A-F0-9]+);").unwrap();
    }
    UNI.replace_all(s, |caps: &Captures| {
        let i = u32::from_str_radix(&caps[1], 16).unwrap();
        char::try_from(i).unwrap().to_string()
    }).into_owned()
}

Playground

3 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.