What does u8 mean?

hack3rcon · July 8, 2023, 10:20am

Hello,
Does u8 mean UTF-8?
Please take a look at the following code:

fn main() {
    for n in 32..127 {
    println!("{}: [{}]", n, n as u8 as char);
    }
for n in 160..256 {
    println!("{}: [{}]", n, n as u8 as char);
    }
}

Thank you.

jofas · July 8, 2023, 10:28am

u8 is an unsigned 8-bit integer. Basically a single byte.

hack3rcon · July 8, 2023, 10:30am

Hello,
Thank you so much for your reply.
Why when I changed it to the u16, then it showed me an error?

jofas · July 8, 2023, 10:35am

Because you can't cast u16 to char, only u8. Coercion/casting in Rust is originally defined in RFC 401 which defines the u8-char-cast:

e has type u8 and U is char; u8-char-cast

Miiao · July 8, 2023, 11:44am

No one asked, but I just shall leave it here.

fn main() {
    (32u8..127)
        .chain(160..=255)
        .for_each(|i| println!("{i}: [{}]", i as char))
}

jonh · July 8, 2023, 12:42pm

Using char::from_u32 might be the better option in code. (depending on what you are doing.)

Also as conversion is definitely not utf-8. (Other functions do such checks/conversion from u8s which may be a utf-8 array.)

    dbg!((255u8 as char as u32).to_le_bytes());
    let mut b = [0,0];
    char::encode_utf8(255u8 as char, &mut b);
    dbg!(b);

[src/main.rs:15] (255u8 as char as u32).to_le_bytes() = [
    255,
    0,
    0,
    0,
]
[src/main.rs:18] b = [
    195,
    191,
]

quinedot · July 8, 2023, 5:30pm

Why can you cast a u8 to a char but can't cast u16 and why does from_u32 return an Option? Well, what is a char?

The char type represents a single character. More specifically, since ‘character’ isn’t a well-defined concept in Unicode, char is a ‘Unicode scalar value’.

Clicks link

Unicode Scalar Value. Any Unicode code point except high-surrogate and low-surrogate code points. In other words, the ranges of integers 0 to D7FF₁₆ and E000₁₆ to 10FFFF₁₆ inclusive. (See definition D76 in Section 3.9, Unicode Encoding Forms.)

Aha, so 0xD800 is a valid u16 but not a valid Unicode scalar value, for example, and u32 has even more values which aren't valid Unicode scalar values.

Every u8 value in contrast is a valid Unicode Scalar Value.

0..128 are ASCII
128..256 are C1 Control Codes and Latin-1
Together these are ISO-8859-1 which used to be very prevalent

Note that none of the values mentioned so far refer to UTF-8. UTF-8 is a variable length encoding consisting of a sequence of bytes; not all sequences (or byte values even) are valid UTF-8. The ASCII values encode as a single byte whereas the Latin-1 values encode as two bytes, for example.

    let ascii = 0x61_u8 as char;
    let latin = 0xc0_u8 as char;
    
    assert_eq!(ascii.len_utf8(), 1);
    assert_eq!(latin.len_utf8(), 2);

String and str, on the other hand, store valid UTF-8 byte sequences.

system · October 6, 2023, 5:31pm

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Is it safe to cast `char.len_utf8()` to `u8`? If it is, why is the return type `usize` in the first place?	2	387	March 21, 2023
Which of these codes is better? help	6	568	October 6, 2023
Converting between char, [u8;2] and u32 help	7	4550	February 18, 2020
[Solved] Rust u8 literal notation help	3	6600	January 12, 2023
How to convert char to u8?	3	33821	January 14, 2021

What does u8 mean?

Related topics