Help parsing a string of unicode characters

Hello forum,

I'm trying to parse strings from my facebook data dump for fun. It encodes everything that's not latin characters in unicode (also for emojis, but I'm not interested in those):

\u00d1\u0087\u00d0\u00b0\u00d1\u0081 \u00d0\u00b8 \u00d0\u00b4\u00d0\u00b5\u00d0\u00b2\u00d0\u00b5\u00d1\u0082 \u00d0\u00bc\u00d0\u00b8\u00d0\u00bd\u00d1\u0083\u00d1\u0082\u00d0\u00b8

This tool gives me the desired output, which is this text in cyrillic:

час и девет минути

I'm sorry but I'm unfamiliar with text encodings and standards. As far as I understand this isn't valid UTF-8, so Rust's built-in String::from_utf8 result in jibberish like this ÐеÑÑÑ. But I guess you can build chars from "\u{00d0}"?

Is there any way I can turn the above unicode into valid cyrllic utf8? Those are my only constraints.

Here's a quick and dirty function to decode your data:

fn unescape(s: &[u8]) -> Result<String, Box<dyn Error>> {
    let mut output = Vec::new();
    let mut i = 0;
    
    while i < s.len() {
        match s[i] {
            b'\\' => {
                i += 1;
                match s[i] {
                    b'u' => {
                        let num = u8::from_str_radix(std::str::from_utf8(&s[i+1..][..4])?, 16)?;
                        output.push(num);
                        i += 4;
                    }
                    byte => output.push(byte),
                }
            },
            byte => output.push(byte),
        }
        i += 1;
    }
    Ok(String::from_utf8(output)?)
}

Complete program on Playground.

This assumes that '\' followed by anything other than 'u' is just that literal character. For example, \\ will decode to a single \. You can easily add support for other escape codes.

You could probably write a much shorter version of this by using the regex crate.

2 Likes

This works on your input, using the unescape crate:

fn main() {
    let input = r"\u00d1\u0087\u00d0\u00b0\u00d1\u0081 \u00d0\u00b8 \u00d0\u00b4\u00d0\u00b5\u00d0\u00b2\u00d0\u00b5\u00d1\u0082 \u00d0\u00bc\u00d0\u00b8\u00d0\u00bd\u00d1\u0083\u00d1\u0082\u00d0\u00b8";
    let output = String::from_utf8(unescape::unescape(input).unwrap().chars().map(|c| c as u8).collect()).unwrap();
    println!("{}", output); // -> час и девет минути
}
4 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.