Reading UTF16 Text from an u8 Slice

I am trying to read UTF16 encoded text from an &[u8]. Preferably this is done without unsafe type casting or extra allocations. I am not sure how to accomplish this. The following code does it completely manually (without handling surrogate pairs):

fn read_string(slice: &[u8], size: usize) -> Option<String> {
    let mut ret = String::with_capacity(size);

    for i in 0..size {
        let c = read_u16(slice, i * 2);
        ret.push(char::try_from(c as u32).ok()?);
    }

    Some(ret)
}

Do you have a suggestion how to do it better?

You should match surrogate pairs for non-BMP characters. And ideally you should parse BOM to determine it's UTF-16BE or UTF-16LE, but practically windows haven't supported BE machines so virtually no document is written in UTF-16BE so just make sure it's UTF-16LE would be enough.

I'd just use <[u8]>::align_to::<u16>() then String::from_utf16():

fn read_string(bytes: &[u8]) -> Option<String> {
    let (front, slice, back) = unsafe {
        bytes.align_to::<u16>()
    };
    if front.is_empty() && back.is_empty() {
        String::from_utf16(slice).ok()
    } else {
        None
    }
}

I would probably go for the decode_utf16 function.

fn read_string(slice: &[u8], size: usize) -> Option<String> {
    assert!(2*size <= slice.len());
    let iter = (0..size)
        .map(|i| u16::from_be_bytes([slice[2*i], slice[2*i+1]));

    decode_utf16(iter).collect::<String>().ok()
}

This is also what String::from_utf16 uses internally, but avoids issues with unaligned byte arrays.

4 Likes

That will randomly fail though as a &[u8] is usually not 2 byte aligned.

3 Likes

Thanks that did it. I had to change the type of the collect function to get it compiling:

pub fn read_string(slice: &[u8], size: usize) -> Option<String> {
    assert!(2*size <= slice.len());
    let iter = (0..size)
        .map(|i| u16::from_be_bytes([slice[2*i], slice[2*i+1]]));

    std::char::decode_utf16(iter).collect::<Result<String, _>>().ok()
}

If it's really UTF-16, it should be.

Can you help me understand why it should be aligned just because it's UTF-16?

If, say, we read it from a file, what would happen if we passed a misaligned buffer to read()? Or if we read it from a socket, say, an HTTP body, what would happen if the UTF-8-encoded header were an odd number of bytes?

"Should" in the sense of "I'd certainly expect it to be" and "it's good practice". Not in the sense that "it probably is" or "it can't possibly be misaligned".

Of course you can't force any arbitrary byte buffer to be aligned to 2-byte boundaries. However, I was arguing that when producing a buffer intended to hold UTF-16 data, one should ensure that it indeed is, because its semantics and the probable use of its contents most likely require that, or at least work best if it is aligned.

It's not hard to do, either: in the worst case, by allocating one more byte than necessary, it's always possible to slice the resulting allocation so that its starting address is 2-aligned. However, most allocators already return 8 or even 16-byte-aligned buffers anyway.

I don't follow your argument about the UTF-8 encoded header with an odd length. A UTF-8 byte sequence is not a valid UTF-16 sequence of 16-bit integers. When one reinterprets b"xy" as a (single-element) sequence of 16-bit integers, one does not obtain the UTF-16 encoded representation of the string "xy".

1 Like

I would expect that the primary reason to have a buffer of UTF-16 in a &[u8] would be IO, and I wouldn't expect a socket library or memory mapped file or etc to be properly aligned.

(If I knew it was going to be UTF-16, I'd probably have a buffer of &[u16].)

1 Like

At the risk of stating the obvious (a habit), have you considered using the encoding_rs crate for this? It might be more optimal (one less copy?) and has a clear notion of UTF-16BE vs UTF-16LE, surrogate handling, and optional Byte-Order-Mark (BOM) sniffing.

3 Likes