How to extract characters from a str in rust without using built-in methods?

I try to print the characters in name str below without using any str built-in methods.

let name = "stack";
string_to_char(&name);
    
fn string_to_char (t: &str)
{
    let r: &char = t as &char; // Rust don't allow to cast &str as &char 
    /* do something to print each character in name */

}

How to achieve above?

But why do you not want to use the methods??


It's not like the chars fall out of the str by themselves, there needs to be some code that actually implements the UTF-8 decoding.

7 Likes

Easy, just use the methods and inline them by hand!

We start with this.

fn string_to_char(t: &str) {
    for c in t.chars() {
        print!("{c}");
    }
}

Then to this:

fn string_to_char(t: &str) {
    let mut chars = Chars { iter: t.as_bytes().iter() };
    while let Some(c) = chars.next() {
        print!("{c}");
    }
}

To this:

#![feature(str_internals)]
use core::str::next_code_point;

fn string_to_char(t: &str) {
    let mut iter = t.as_bytes().iter();
    unsafe {
        while let Some(ch) = next_code_point(&mut iter) {
            let c = char::from_u32_unchecked(ch);
            print!("{c}");
        }
    }
}

You are done if you are fine with nightly feature and using next_code_point.

If not, then we copy paste next_code_point to our own code, and finally get this

fn string_to_char(t: &str) {
    let mut iter = t.as_bytes().iter();
    unsafe {
        while let Some(ch) = next_code_point(&mut iter) {
            let c = char::from_u32_unchecked(ch);
            print!("{c}");
        }
    }
}

#[inline]
pub unsafe fn next_code_point<'a, I: Iterator<Item = &'a u8>>(bytes: &mut I) -> Option<u32> {
    // Decode UTF-8
    let x = *bytes.next()?;
    if x < 128 {
        return Some(x as u32);
    }

    // Multibyte case follows
    // Decode from a byte combination out of: [[[x y] z] w]
    // NOTE: Performance is sensitive to the exact formulation here
    let init = utf8_first_byte(x, 2);
    // SAFETY: `bytes` produces an UTF-8-like string,
    // so the iterator must produce a value here.
    let y = unsafe { *bytes.next().unwrap_unchecked() };
    let mut ch = utf8_acc_cont_byte(init, y);
    if x >= 0xE0 {
        // [[x y z] w] case
        // 5th bit in 0xE0 .. 0xEF is always clear, so `init` is still valid
        // SAFETY: `bytes` produces an UTF-8-like string,
        // so the iterator must produce a value here.
        let z = unsafe { *bytes.next().unwrap_unchecked() };
        let y_z = utf8_acc_cont_byte((y & CONT_MASK) as u32, z);
        ch = init << 12 | y_z;
        if x >= 0xF0 {
            // [x y z w] case
            // use only the lower 3 bits of `init`
            // SAFETY: `bytes` produces an UTF-8-like string,
            // so the iterator must produce a value here.
            let w = unsafe { *bytes.next().unwrap_unchecked() };
            ch = (init & 7) << 18 | utf8_acc_cont_byte(y_z, w);
        }
    }

    Some(ch)
}

/// Returns the initial codepoint accumulator for the first byte.
/// The first byte is special, only want bottom 5 bits for width 2, 4 bits
/// for width 3, and 3 bits for width 4.
#[inline]
const fn utf8_first_byte(byte: u8, width: u32) -> u32 {
    (byte & (0x7F >> width)) as u32
}

/// Returns the value of `ch` updated with continuation byte `byte`.
#[inline]
const fn utf8_acc_cont_byte(ch: u32, byte: u8) -> u32 {
    (ch << 6) | (byte & CONT_MASK) as u32
}

const CONT_MASK: u8 = 0b0011_1111;

fn main() {
    string_to_char("我是一个好人");
}

And we are done! :tada:

3 Likes

Well, as_bytes still is a method of str. (Though of course, the requirement not to use these methods is non-sensical in the first place.)

2 Likes

Oh right. I forgot to inline as_bytes into std::mem::transmute. But my reply was mostly focused on "you need non-trivial code to handle unicode codepoint", soooo I guess I will let it slip.

1 Like

Transmuting is a bad idea because the layout of fat pointers is not stable, so it could become UB with future compiler versions. The standard library is versioned together with the compiler, so it gets to cheat and is allowed to ignore stability concerns. But casting a pointer (*const str to *const u8) should probably work fine.

4 Likes

It is undefined behavior now. It's only that it might break in future versions.

It always depends on the notion of "undefined behavior" you apply. I wanted to avoid the discussion about different kinds of UB by changing my comment not to clearly say "it is UB", but apparently I failed in that it now too strongly implies that it isn't UB.

You are of course right that with a notion of "library UB"-style undefined behavior, it clearly is UB already,(even though this is not really about the standard library as str is a built-in type) since it's an operation that is allowed to become true "language UB" undefined behavior in the future. On the other hand, with a "the stuff that miri would report"-style notion of undefined behavior, it isn't UB, otherwise it wouldn't even be allowed to happen intetnally within the standard library either.


Edit thinking about better wording that avoids the point of what is or isn't what kind of UB, maybe I should have simply said that "it is unsound".

1 Like

Not using methods on str is like getting rid of std and core library, which does not make any sense to me.

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.