I try to print the characters in name str below without using any str built-in methods.
let name = "stack";
string_to_char(&name);
fn string_to_char (t: &str)
{
let r: &char = t as &char; // Rust don't allow to cast &str as &char
/* do something to print each character in name */
}
Easy, just use the methods and inline them by hand!
We start with this.
fn string_to_char(t: &str) {
for c in t.chars() {
print!("{c}");
}
}
Then to this:
fn string_to_char(t: &str) {
let mut chars = Chars { iter: t.as_bytes().iter() };
while let Some(c) = chars.next() {
print!("{c}");
}
}
To this:
#![feature(str_internals)]
use core::str::next_code_point;
fn string_to_char(t: &str) {
let mut iter = t.as_bytes().iter();
unsafe {
while let Some(ch) = next_code_point(&mut iter) {
let c = char::from_u32_unchecked(ch);
print!("{c}");
}
}
}
You are done if you are fine with nightly feature and using next_code_point.
If not, then we copy paste next_code_point to our own code, and finally get this
fn string_to_char(t: &str) {
let mut iter = t.as_bytes().iter();
unsafe {
while let Some(ch) = next_code_point(&mut iter) {
let c = char::from_u32_unchecked(ch);
print!("{c}");
}
}
}
#[inline]
pub unsafe fn next_code_point<'a, I: Iterator<Item = &'a u8>>(bytes: &mut I) -> Option<u32> {
// Decode UTF-8
let x = *bytes.next()?;
if x < 128 {
return Some(x as u32);
}
// Multibyte case follows
// Decode from a byte combination out of: [[[x y] z] w]
// NOTE: Performance is sensitive to the exact formulation here
let init = utf8_first_byte(x, 2);
// SAFETY: `bytes` produces an UTF-8-like string,
// so the iterator must produce a value here.
let y = unsafe { *bytes.next().unwrap_unchecked() };
let mut ch = utf8_acc_cont_byte(init, y);
if x >= 0xE0 {
// [[x y z] w] case
// 5th bit in 0xE0 .. 0xEF is always clear, so `init` is still valid
// SAFETY: `bytes` produces an UTF-8-like string,
// so the iterator must produce a value here.
let z = unsafe { *bytes.next().unwrap_unchecked() };
let y_z = utf8_acc_cont_byte((y & CONT_MASK) as u32, z);
ch = init << 12 | y_z;
if x >= 0xF0 {
// [x y z w] case
// use only the lower 3 bits of `init`
// SAFETY: `bytes` produces an UTF-8-like string,
// so the iterator must produce a value here.
let w = unsafe { *bytes.next().unwrap_unchecked() };
ch = (init & 7) << 18 | utf8_acc_cont_byte(y_z, w);
}
}
Some(ch)
}
/// Returns the initial codepoint accumulator for the first byte.
/// The first byte is special, only want bottom 5 bits for width 2, 4 bits
/// for width 3, and 3 bits for width 4.
#[inline]
const fn utf8_first_byte(byte: u8, width: u32) -> u32 {
(byte & (0x7F >> width)) as u32
}
/// Returns the value of `ch` updated with continuation byte `byte`.
#[inline]
const fn utf8_acc_cont_byte(ch: u32, byte: u8) -> u32 {
(ch << 6) | (byte & CONT_MASK) as u32
}
const CONT_MASK: u8 = 0b0011_1111;
fn main() {
string_to_char("我是一个好人");
}
Oh right. I forgot to inline as_bytes into std::mem::transmute. But my reply was mostly focused on "you need non-trivial code to handle unicode codepoint", soooo I guess I will let it slip.
Transmuting is a bad idea because the layout of fat pointers is not stable, so it could become UB with future compiler versions. The standard library is versioned together with the compiler, so it gets to cheat and is allowed to ignore stability concerns. But casting a pointer (*const str to *const u8) should probably work fine.
It always depends on the notion of "undefined behavior" you apply. I wanted to avoid the discussion about different kinds of UB by changing my comment not to clearly say "it is UB", but apparently I failed in that it now too strongly implies that it isn't UB.
You are of course right that with a notion of "library UB"-style undefined behavior, it clearly is UB already,(even though this is not really about the standard library as str is a built-in type) since it's an operation that is allowed to become true "language UB" undefined behavior in the future. On the other hand, with a "the stuff that miri would report"-style notion of undefined behavior, it isn't UB, otherwise it wouldn't even be allowed to happen intetnally within the standard library either.
Edit thinking about better wording that avoids the point of what is or isn't what kind of UB, maybe I should have simply said that "it is unsound".