The reason why you’re getting an error is because latin1 is not compatible with UTF-8. It is a single byte encoding, which means any non-ASCII bytes in latin1 encoded text cannot be correctly decoded as UTF-8. The easiest option available to you is to read your data as raw bytes and then transcode it from latin1 to UTF-8. Any byte less than 0x80
would be ASCII and therefore translate as is. Other bytes will need a translation table, for example: https://www.unicode.org/charts/PDF/U0080.pdf
If you don’t want to roll it by hand—you probably shouldn’t—then you’ll want to use an encoding library to do it for you. There are two prominent crates you can choose from to do this, encoding
and encoding_rs
. If you use the latter, then you can use encoding_rs_io
to get simple streaming decoding. For example, consider this program:
use std::io::{self, Read};
use std::fs::File;
use encoding_rs::WINDOWS_1252;
use encoding_rs_io::DecodeReaderBytesBuilder;
fn main() -> Result<(), io::Error> {
let mut rdr = DecodeReaderBytesBuilder::new()
.encoding(Some(WINDOWS_1252))
.build(File::open("data.txt")?);
let mut string = String::new();
// This is guaranteed to never return a UTF-8 decoding error since the
// transcoding guarantees that its output is valid UTF-8.
rdr.read_to_string(&mut string)?;
println!("string: {:?}", string);
println!(" bytes: {:X?}", string.as_bytes());
Ok(())
}
With these dependencies
[dependencies]
encoding_rs = "0.8"
encoding_rs_io = "0.1.4"
and this data file:
$ cat data.txt
foo©bar
$ xxd data.txt
00000000: 666f 6fa9 6261 720a foo.bar.
its output is:
$ cargo run
string: "foo©bar\n"
bytes: [66, 6F, 6F, C2, A9, 62, 61, 72, A]
What ends up happening is the ©
symbol is encoded as the single byte \xA9
in latin1, but the transcoding process converts it to \xC2\xA9
, which is the UTF-8 representation of the same character.
Caveat emptor: I’ve assumed that you mean Windows-1252 by mentioning Latin1, but this isn’t necessarily true. If you really did mean ISO-8859-1, then you’ll need to use the encoding
crate since encoding_rs
does not support ISO-8859-1. (encoding_rs
is scoped to dealing with the Encoding Standard, which is focused on the web. But it’s still useful for things outside of the web when the use cases align.) EDIT: As @hsivonen points out below, the encoding_rs
crate does have functions for converting latin1 to UTF-8 in its mem
submodule, e.g., convert_latin1_to_str
, although you cannot use those with encoding_rs_io
.