I'm trying to parse strings from my facebook data dump for fun. It encodes everything that's not latin characters in unicode (also for emojis, but I'm not interested in those):

\u00d1\u0087\u00d0\u00b0\u00d1\u0081 \u00d0\u00b8 \u00d0\u00b4\u00d0\u00b5\u00d0\u00b2\u00d0\u00b5\u00d1\u0082 \u00d0\u00bc\u00d0\u00b8\u00d0\u00bd\u00d1\u0083\u00d1\u0082\u00d0\u00b8

This tool gives me the desired output, which is this text in cyrillic:

час и девет минути

I'm sorry but I'm unfamiliar with text encodings and standards. As far as I understand this isn't valid UTF-8, so Rust's built-in String::from_utf8 result in jibberish like this ÐеÑÑÑ. But I guess you can build chars from "\u{00d0}"?

Is there any way I can turn the above unicode into valid cyrllic utf8? Those are my only constraints.

Here's a quick and dirty function to decode your data:

fn unescape(s: &[u8]) -> Result<String, Box<dyn Error>> {
    let mut output = Vec::new();
    let mut i = 0;
    while i < s.len() {
        match s[i] {
            b'\\' => {
                i += 1;
                match s[i] {
                    b'u' => {
                        let num = u8::from_str_radix(std::str::from_utf8(&s[i+1..][..4])?, 16)?;
                        i += 4;
                    byte => output.push(byte),
            byte => output.push(byte),
        i += 1;

Complete program on Playground.

This assumes that '\' followed by anything other than 'u' is just that literal character. For example, \\ will decode to a single \. You can easily add support for other escape codes.

You could probably write a much shorter version of this by using the regex crate.


This works on your input, using the unescape crate:

fn main() {
    let input = r"\u00d1\u0087\u00d0\u00b0\u00d1\u0081 \u00d0\u00b8 \u00d0\u00b4\u00d0\u00b5\u00d0\u00b2\u00d0\u00b5\u00d1\u0082 \u00d0\u00bc\u00d0\u00b8\u00d0\u00bd\u00d1\u0083\u00d1\u0082\u00d0\u00b8";
    let output = String::from_utf8(unescape::unescape(input).unwrap().chars().map(|c| c as u8).collect()).unwrap();
    println!("{}", output); // -> час и девет минути

