Get vec u8 as from utf-8 file

File text.txt contains data not in encoding utf-8

text.txt

my text Зд

I want to get a vector u of this data in utf-8
and can do this:

use std::fs;
use encoding_rs::WINDOWS_1251;

fn main() {
   // read file as a Vector (windows-1251)
   let v = fs::read("/.../text.txt").expect("error on handling file");

   println!("{:?}", v.len());   // 10
   println!("{:?}", v);         // [109, 121, 32, 116, 101, 120, 116, 32, 199, 228]

   let (cow, _encoding_used, _had_errors) = WINDOWS_1251.decode(&v);
   println!("Decoded string: {}", cow);

   let mut u: Vec<u8> = vec![];
   for b in cow.bytes() { u.push(b) };

   println!("{:?}", u.len());   //  12
   println!("{:?}", u);         //  [109, 121, 32, 116, 101, 120, 116, 32, 208, 151, 208, 180]
}

looks crooked
how to do it right?

This will give your the UTF-8 bytes of your string "my text Зд", which is behaving just as expected

fn main() {
   println!("{:?}", "my text Зд".as_bytes()); // [109, 121, 32, 116, 101, 120, 116, 32, 208, 151, 208, 180]
}

If you want to go back from UTF-8 to WINDOWS_1251, you can also use the capabilities of encoding_rs to encode unicode back into all formats e.g. via the method called “encode”.

1 Like

I understand it
Looks like I'm doing unnecessary things
You cannot edit directly from the file?

from
[109, 121, 32, 116, 101, 120, 116, 32, 199, 228]
to
[109, 121, 32, 116, 101, 120, 116, 32, 208, 151, 208, 180]

at the stage of reading from a file.

It is not at all clear what it is that you’re trying to do, but on the question if “doing it right” in general, I’d suggest not working with Windows 1251 encoding at all, if you can avoid the need to do so, and making sure your file (and text editor) work with UTF-8 to begin with. How this works in your text editor depends on what text editor you’re using.


In case you’re confused about the length of the bytes, that’s because UTF-8 is a variable-length encoding where most characters (all characters that aren’t part of the 128 ASCII characters) require more than one byte per character. In the case of cycillic characters, they usually need a length of 2 bytes, enough to represent the first 2048 unicode characters, and the basic block of cyrillic characters is between U+0400 and U+04FF, which in decimal is the characters number 1024 to 1279.

In your output thus the bytes 208, 151 are the character З and the bytes 208, 180 are д.


I’m not entirely sure what you’re trying to do.

Do you just want to obtain the Vec<u32> containing [109, 121, 32, 116, 101, 120, 116, 32, 208, 151, 208, 180] without the additional intermediate steps? That can be done.

The for b in cow.bytes() { u.push(b) }; loop is unnecessary anyways. You can just do cow.into_owned().into_bytes().

The intermediate Vec<u8> from the first file read can probably somehow be skipped, too. I haven’t looked into encoding_rs deeply enough to see what the best way to do this would be.

3 Likes

Get [109, 121, 32, 116, 101, 120, 116, 32, 208, 151, 208, 180] without the additional intermediate steps?

Yes.

encoding_rs says to use encoding_rs_io.

Looks like that would be something like

let file = fs::File::open("/.../text.txt").expect("error on handling file");
let decoder = encoding_rs_io::DecodeReaderBytesBuilder::new()
    .encoding(Some(WINDOWS_1251))
    .build(file);
let mut string = String::new();
decoder.read_to_string(&mut string).unwrap();

Or you can use read_to_end to get a Vec. Not sure if this buffers internally (probably?) but if not, you should wrap the file in BufReader.

3 Likes

looks like it does:

    pub fn build<R: io::Read>(&self, rdr: R) -> DecodeReaderBytes<R, Vec<u8>> {
        self.build_with_buffer(rdr, vec![0; 8 * (1 << 10)]).unwrap()
    }
2 Likes

Thank you.

Thanks, cool.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.