What does u8 mean?

Hello,
Does u8 mean UTF-8?
Please take a look at the following code:

fn main() {
    for n in 32..127 {
    println!("{}: [{}]", n, n as u8 as char);
    }
for n in 160..256 {
    println!("{}: [{}]", n, n as u8 as char);
    }
}

Thank you.

u8 is an unsigned 8-bit integer. Basically a single byte.

1 Like

Hello,
Thank you so much for your reply.
Why when I changed it to the u16, then it showed me an error?

Because you can't cast u16 to char, only u8. Coercion/casting in Rust is originally defined in RFC 401 which defines the u8-char-cast:

  • e has type u8 and U is char; u8-char-cast
3 Likes

No one asked, but I just shall leave it here.

fn main() {
    (32u8..127)
        .chain(160..=255)
        .for_each(|i| println!("{i}: [{}]", i as char))
}
1 Like

Using char::from_u32 might be the better option in code. (depending on what you are doing.)

Also as conversion is definitely not utf-8. (Other functions do such checks/conversion from u8s which may be a utf-8 array.)

    dbg!((255u8 as char as u32).to_le_bytes());
    let mut b = [0,0];
    char::encode_utf8(255u8 as char, &mut b);
    dbg!(b);
[src/main.rs:15] (255u8 as char as u32).to_le_bytes() = [
    255,
    0,
    0,
    0,
]
[src/main.rs:18] b = [
    195,
    191,
]
3 Likes

Why can you cast a u8 to a char but can't cast u16 and why does from_u32 return an Option? Well, what is a char?

The char type represents a single character. More specifically, since ‘character’ isn’t a well-defined concept in Unicode, char is a ‘Unicode scalar value’.

Clicks link

Unicode Scalar Value. Any Unicode code point except high-surrogate and low-surrogate code points. In other words, the ranges of integers 0 to D7FF16 and E00016 to 10FFFF16 inclusive. (See definition D76 in Section 3.9, Unicode Encoding Forms.)

Aha, so 0xD800 is a valid u16 but not a valid Unicode scalar value, for example, and u32 has even more values which aren't valid Unicode scalar values.

Every u8 value in contrast is a valid Unicode Scalar Value.

Note that none of the values mentioned so far refer to UTF-8. UTF-8 is a variable length encoding consisting of a sequence of bytes; not all sequences (or byte values even) are valid UTF-8. The ASCII values encode as a single byte whereas the Latin-1 values encode as two bytes, for example.

    let ascii = 0x61_u8 as char;
    let latin = 0xc0_u8 as char;
    
    assert_eq!(ascii.len_utf8(), 1);
    assert_eq!(latin.len_utf8(), 2);

String and str, on the other hand, store valid UTF-8 byte sequences.

2 Likes