Hello,
Does u8
mean UTF-8
?
Please take a look at the following code:
fn main() {
for n in 32..127 {
println!("{}: [{}]", n, n as u8 as char);
}
for n in 160..256 {
println!("{}: [{}]", n, n as u8 as char);
}
}
Thank you.
Hello,
Does u8
mean UTF-8
?
Please take a look at the following code:
fn main() {
for n in 32..127 {
println!("{}: [{}]", n, n as u8 as char);
}
for n in 160..256 {
println!("{}: [{}]", n, n as u8 as char);
}
}
Thank you.
Hello,
Thank you so much for your reply.
Why when I changed it to the u16
, then it showed me an error?
Because you can't cast u16
to char
, only u8
. Coercion/casting in Rust is originally defined in RFC 401 which defines the u8-char-cast:
e
has typeu8
andU
ischar
; u8-char-cast
No one asked, but I just shall leave it here.
fn main() {
(32u8..127)
.chain(160..=255)
.for_each(|i| println!("{i}: [{}]", i as char))
}
Using char::from_u32 might be the better option in code. (depending on what you are doing.)
Also as
conversion is definitely not utf-8. (Other functions do such checks/conversion from u8s which may be a utf-8 array.)
dbg!((255u8 as char as u32).to_le_bytes());
let mut b = [0,0];
char::encode_utf8(255u8 as char, &mut b);
dbg!(b);
[src/main.rs:15] (255u8 as char as u32).to_le_bytes() = [
255,
0,
0,
0,
]
[src/main.rs:18] b = [
195,
191,
]
Why can you cast a u8
to a char
but can't cast u16
and why does from_u32
return an Option
? Well, what is a char
?
The
char
type represents a single character. More specifically, since ‘character’ isn’t a well-defined concept in Unicode,char
is a ‘Unicode scalar value’.
Clicks link
Unicode Scalar Value. Any Unicode code point except high-surrogate and low-surrogate code points. In other words, the ranges of integers 0 to D7FF16 and E00016 to 10FFFF16 inclusive. (See definition D76 in Section 3.9, Unicode Encoding Forms.)
Aha, so 0xD800
is a valid u16
but not a valid Unicode scalar value, for example, and u32
has even more values which aren't valid Unicode scalar values.
Every u8
value in contrast is a valid Unicode Scalar Value.
Note that none of the values mentioned so far refer to UTF-8. UTF-8 is a variable length encoding consisting of a sequence of bytes; not all sequences (or byte values even) are valid UTF-8. The ASCII values encode as a single byte whereas the Latin-1 values encode as two bytes, for example.
let ascii = 0x61_u8 as char;
let latin = 0xc0_u8 as char;
assert_eq!(ascii.len_utf8(), 1);
assert_eq!(latin.len_utf8(), 2);
String
and str
, on the other hand, store valid UTF-8 byte sequences.