How to write! a file with text in UTF-16 LE BOM encoding?

I'm trying to generate a file which should get UTF-16 encoding.

According to notepad++, these files have this encoding:

UTF-16 LE BOM

At the moment, I have this:

        let path = &self.filename;
        let mut output = File::create(path)?;
        writeln!(output,"blabla m³")?;

When I open my generated files in the viewer, they get some weird characters like so:

blabla m³

So basically, I want to specify the encoding with writeln! but so far haven't found a way to do that.

Rust expects all strings to be UTF-8. If you need other encodings, convert the string to the appropriate raw byte representation. You can use eg. the encoding_rs crate for that.

3 Likes

The standard library also has functions for converting into utf-16, but they're somewhat annoying to use since you have to convert from u16 into bytes yourself.

I'm not sure about this approach, but write expects a [u8] but I only have u8's after encoding.

Any suggestions?

        let bytes_utf16: Vec<u8> = UTF_16LE.encode(oh_my, EncoderTrap::Strict).unwrap();
        for byte in &bytes_utf16{            
            output.write(byte); // This doesn't work
        }

mismatched types

expected slice [u8], found u8

You have the &[u8] right at your disposal. You can just borrow the vector itself (to apply an implicit deref coercion) or explicitly call bytes_utf16.as_slice(). Even if you didn't know that, you could trivially rewrite the loop body as output.write(&[*byte]), although that would likely be terribly inefficient.

By the way, you likely don't want write() but write_all() since write() doesn't guarantee that it will write the entire buffer in one go. You also shouldn't ignore the Result return value of any I/O methods you are calling.

2 Likes

I managed to generate a text file that is "UTF-16 little endian" encoded (according to notepad++).

The code below first encodes the source to UTF-16 (a vec of u16) which is then written as bytes into a file (using bincode, because I can't get it working with &[*byte] ).

use bincode;
use std::fs::File;
use std::io::{Write};
use std::ffi::OsString;
use anyhow::{ Result};

fn main() -> Result<()> {
    let source = String::from("³°✨");
    let path = OsString::from("out.txt");
    let mut output = File::create(path)?;
    output.write(&[255,254])?;    // the BOM part
    for utf16 in source.encode_utf16() {        
        output.write(&(bincode::serialize(&utf16).unwrap()))?;
    }    
    Ok(())
}

You literally just have to say output.write_all(&the_byte_buffer). Playground.

The BOM also has to be written with write_all.

It looks like UTF16-LE encoding is not (yet) implemented.

    let (encoded, enc, _) = UTF_16LE.encode(source);
    println!("{:?}",enc);

prints out that it used utf8.

This is also what the documentation metions and what you see if you open the txt file in notepad++.

I got rid of the bincode stuff and am now using to_le_bytes

    // this is slow, but has the right encoding
    output.write_all(&[0xFF, 0xFE])?; // the BOM part    
    for utf16 in my_string.encode_utf16() {              
      output.write_all(&utf16.to_le_bytes())?;
    }    

The only problem is that this method is pretty slow (for very large files).

You can wrap output in a BufWriter, which will group together the writes to minimize syscall overhead. Once you do that, that's about as fast as it can get without a custom implementation.

Thank you!

  • Without converting to utf-16 and without Bufwriter : < 1sec
  • Converting to utf-16 (to_le_bytes) and without Bufwriter : 5-10 sec
  • Converting to utf-16 (to_le_bytes) and using Bufwriter : < 1sec