If I parse everything except the {...} block and use serde to parse that, given that it's JSON, then body will not maintain the correct byte representation because Strings in Rust are UTF-8: it will be [8, 1, 18, 195, 133] instead of [ 8, 1, 18, 197], because of the different representation of Å in ASCII and UTF-8. However what I need is the same byte representation!
I'd need to interpret the string as ASCII and then unescape it, but AsciiStr does not implement Deserialize and if I use Vec<u8> as the field type serde complains.
This is a [u8] in disguise, so it's arguable whether serde is right or wrong on this, but is there a way to get the underlying bytes without rewritting my own deserializer and unescaper? Is it even possible to write a deserializer that reads the bytes of a json string using serde!?
Ah, good to know. However, Å is not considered ASCII because its representation is over 128, so it's still not solved. Python quite happily knows the correct byte representation of it though:
I think the title is very misleading because of this. When I read "escaped JSON ascii string", I assume the JSON only contains (potentially escaped) ASCII. In fact I'm not sure what you mean by your title. Why not just say "escaped JSON string"?
Because the problem lies in an (almost-)ASCII vs UTF-8 problem, not in the escaping, specifically in the binary representation of Å in that string. It might not be the best title and am open to other suggestions, but if you remove ASCII then there is no problem.
It’s "correct" if you assume a particular encoding. It’s not valid ASCII or valid UTF-8, but it happens to be correct in ISO-8859-1 and related encodings. It is also the Unicode code for Å (intentionally copying the first 256 ISO-8859-1 codes), and that’s what Python gives, but that one’s not a byte value but a 32-bit one (Python, of course, doesn’t distinguish between different integer sizes). If you want to decode or encode ISO-8859-1 (which is mostly useful when interfacing with >20-year-old legacy systems, then you should use the appropriate libraries for that.
Good call on the ISO-8859-1, my locale is all en_US.UTF-8, so I don't know why Firefox and Python are using that.
Anyway, I can't make this work. If I read the json as bytes (from a file that I pasted the contents into) and decode it before deserializing, I get body.as_bytes() as [8, 1, 18, 195, 131, 226, 128, 166].
This is becoming less and less about Rust, but the fact is that in Python I can just paste the string and when iterating the chars and calling ord() on them I get the correct bytes.
Python's ord() isn't giving you bytes; it's giving you Unicode codepoints. (This can be seen in part by evaluating ord('☃') in Python, which gives you 9731, which is too large for a byte.) It just so happens that the codepoints in range 0x80..0x100 are the same characters in both Unicode and ISO-8859-1 (a.k.a. Latin-1).
If you can assume that the "body" string you're trying to deserialize only contains codepoints less than 256, you can convert it to your desired bytes after String-deserialization with body.chars().map(|ch| u8::try_from(ch as u32)).collect::<Result<Vec<u8>>, _>(), which will give you an Err if any characters are outside of 0..256.