Reading Latin1 ASCII chars from a binary file

#1

Hi all,

I have to read a binary file having lots of null terminated ASCII strings, containing Latin1 chars (accentued french chars).

The String::from_utf8 is returning an error when meeting those chars, and the lossy version replaces these with the ? UTF8 char.

How to deal with this issue?

Thanks a lot for your help.

0 Likes

#2

The reason why you’re getting an error is because latin1 is not compatible with UTF-8. It is a single byte encoding, which means any non-ASCII bytes in latin1 encoded text cannot be correctly decoded as UTF-8. The easiest option available to you is to read your data as raw bytes and then transcode it from latin1 to UTF-8. Any byte less than 0x80 would be ASCII and therefore translate as is. Other bytes will need a translation table, for example: https://www.unicode.org/charts/PDF/U0080.pdf

If you don’t want to roll it by hand—you probably shouldn’t—then you’ll want to use an encoding library to do it for you. There are two prominent crates you can choose from to do this, encoding and encoding_rs. If you use the latter, then you can use encoding_rs_io to get simple streaming decoding. For example, consider this program:

use std::io::{self, Read};
use std::fs::File;

use encoding_rs::WINDOWS_1252;
use encoding_rs_io::DecodeReaderBytesBuilder;

fn main() -> Result<(), io::Error> {
    let mut rdr = DecodeReaderBytesBuilder::new()
        .encoding(Some(WINDOWS_1252))
        .build(File::open("data.txt")?);
    let mut string = String::new();
    // This is guaranteed to never return a UTF-8 decoding error since the
    // transcoding guarantees that its output is valid UTF-8.
    rdr.read_to_string(&mut string)?;

    println!("string: {:?}", string);
    println!(" bytes: {:X?}", string.as_bytes());
    Ok(())
}

With these dependencies

[dependencies]
encoding_rs = "0.8"
encoding_rs_io = "0.1.4"

and this data file:

$ cat data.txt
foo©bar
$ xxd data.txt
00000000: 666f 6fa9 6261 720a                      foo.bar.

its output is:

$ cargo run
string: "foo©bar\n"
 bytes: [66, 6F, 6F, C2, A9, 62, 61, 72, A]

What ends up happening is the © symbol is encoded as the single byte \xA9 in latin1, but the transcoding process converts it to \xC2\xA9, which is the UTF-8 representation of the same character.

Caveat emptor: I’ve assumed that you mean Windows-1252 by mentioning Latin1, but this isn’t necessarily true. If you really did mean ISO-8859-1, then you’ll need to use the encoding crate since encoding_rs does not support ISO-8859-1. (encoding_rs is scoped to dealing with the Encoding Standard, which is focused on the web. But it’s still useful for things outside of the web when the use cases align.) EDIT: As @hsivonen points out below, the encoding_rs crate does have functions for converting latin1 to UTF-8 in its mem submodule, e.g., convert_latin1_to_str, although you cannot use those with encoding_rs_io.

2 Likes

#3

@BurntSushi

Thanks a lot for the hint ! I’ll have to dig into the details, today if I’ve got time.

I’ll keep you posted.

0 Likes

#4

Seems like a good guess given the original use case described.

Note that encoding_rs::mem has slice-oriented conversion functions for non-windows-1252 actual ISO-8859-1. See functions whose name starts with convert_latin1_ in the mem module.

1 Like