Unicode escape characters from a file

Hello! I am a complete newbie, so this might be trivial, but I just can't figure it out and wasn't lucky with my searches either.

I have an input file containing a number of random text. These texts include hex escapes as well (see example below).
My goal would be to convert these with regex to the correct unicode escape format: \u{0000}
E.g:
File content is the following.
"\x27"

The idea is that if I have this:

let mystr = String::from("\u{0027}");
println!("{}, mystr");

It will print out a single apostrophe (').
My code looks like this:

use regex::{Regex, Captures};
fn text_parser(i: &String) -> usize {
    let mut l = i.lines().next().unwrap();
    let re = regex::Regex::new(r"x([a-zA-Z0-9]{2})").unwrap();    
    let s = re.replace_all(l, |caps: &Captures| format!("{}{}{}", "u{00", &caps[1], "}"));

This almost looks good, but it is going to be an escaped sequence of characters resulting in "\u{0027}" instead of the actual unicode character.

How can I make sure that the resulting string is a valid unicode escape?

You can create a char by its unicode value via char::from_u32 and you can parse hey-encoded integers to u32 via u32::from_str_radix.

Note that with your general approach here, you will not be able to convert escapes of all that many unicode characters; you'd interpret every hex-escaped byte on its own as a character thats from the extended ASCII (first 256 unicode scalar values). If the goal is that all characters should be representable via escapes in the input file, you would need to specify first what kind of encoding is expected in the first place, and possibly also how invalid encodings should be handled.

2 Likes

Thanks a lot and for the additional info as well! u32::from_str_radix will be the thing I am looking for

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.