Converting between char, [u8;2] and u32

Hi, I'm working on exercism, trying to reverse a string.

For exercise sake I split the string into an u8 array and implement the Iterator Trait, trying to detect UTF-8 Code Points. However, "disassembling" and "reassembling" an UTF-8 char does not work:

fn main() {
    let a = "Ü";

    let byte_array = a.as_bytes();
    println!("byte_array: {:?}", byte_array);
    // [0xc3, 0x9c]

    let integer: u32 = (0xc3 << 8) + 0x9c;
    println!("integer: 0x{:x}", integer);

    let codepoint = std::char::from_u32(integer).unwrap();
    println!("codepoint: '{}'", codepoint);

    // byte order problem?
    let integer_swapped = (integer >> 8) | ((integer & 0xff) << 8);
    println!("ingteger swapped: 0x{:x}", integer_swapped);

    let codepoint2 = std::char::from_u32(integer_swapped).unwrap();
    println!("codepoint2: '{}'", codepoint2);
}

It yields:

byte_array: [195, 156]
integer: 0xc39c
codepoint: '쎜'
integer swapped: 0x9cc3
codepoint2: '鳃'

What am I doing wrong, why can't I construct an UTF-8 char from its binary value?

Some bits in utf-8 are not a part of a codepoint but are part of encoding itself. Please see the table at the beginning of the Description section in the wiki article https://en.m.wikipedia.org/wiki/UTF-8#Description

You need to take into account only "x" bits when manually decoding. {:08b} formatting will be helpful when debugging :slight_smile:

Btw, please note that for a codepoint by codepoint reversal you don't need to decode, just detect when a codepoint starts and ends

The documentation (and implementation!) of the unicode-reverse crate may be illuminating. :slight_smile:

Check out crates.io for a crate to help you. It is an integral part of the language to learn how to find and use crates if you need extra functionality.

1 Like

I got it:

When using std::string::as_bytes(), I get the encoded code point:

let byte_array = "Ü".as_bytes(); 
// = [0xc3, 0x9c] 
// = [0b11000011, 0b10011100]

In 0b11xxxxxx and 0b10xxxxxx the 11 and 10 are from the encoding and the "x"s assemble to the value of the code point.

My mistake was to assume, that 0xc39c is the "code point". It is not. It is the UTF-8 encoded code point. I'd have to extract the u32 code point first, before passing it to std::char::from_u32():

((0xc3 & 0b00111111) << 6) | (0x9c & 0b00111111) 
// = 220
// std::char::from_u32(220) = Some('Ü')

Thanks!

Yes, because strings in Rust are defined to be UTF-8. Converting a &str to a slice of bytes doesn't do anything except changing the type (==weakening the guarantees): the .as_bytes() method returns exactly the same pointer that is the str itself, i.e. the view to the underlying UTF-8 encoded buffer.

If you want to reverse a string codepoint-by-codepoint, you don't have to (and consequently, shouldn't) perform the decoding and re-encoding yourself. You can just use the .chars() method that returns an iterator over code points (that Rust calls char).

Furthermore, note that UTF-8 is a variable-length encoding on two levels. First of all, there's the level of bytes-to-codepoints mapping. But even reversing a string by code points isn't correct, because it can, for example, change the order of an accented base character and its accent. If you truly want to preserve the visible "characters" and merely change their orders, you should reverse what are called grapheme clusters. You can get hold of an iterator over grapheme clusters for example by using the unicode-segmentation crate.

2 Likes

Some previous discussion:

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.