How to construct a raw `str`, not `&str`

str is a primitive type in Rust. String literals in Rust are &str by default.
Is there anyway to construct a raw str? Or is raw str meaningful in Rust?

1 Like

str is an unsized type, which means you can't directly own it. You can only ever have it behind some kind of reference (such as Box<str>). The backing data, however, could live anywhere, be unsized or sized; the only requirement is that it's a valid sequence of UTF-8 bytes.

10 Likes

A metaphor I read somewhere is that str and [T] are a bit like the water in a glass. They can be kept in different containers, but need at least something to hold them. str is typically stored in String or the static memory, and [T] is typically the content of an array or Vec<T>. They are the sequences of data inside those containers, not containers on their own.

13 Likes

Is there an "sized array of UTF-8-encoded bytes" type in the analogy of "??? is to str as [u8; N] is to [u8]?

Not natively in std, because it turns out that such a type isn't very useful. Since UTF8 is a variable length encoding, even properly generalized uppercasing may need to add or remove characters. You'd have to determine the size of the container in bytes, not characters, which is a little counter intuitive.

That said, the arrayvec crate provides an ArrayString type. Others have also done the same, though they aren't as popular.

1 Like

I think such a type would be useful for string literals

Just adding on to what @Kyllingene said:

Not only is UTF-8 is a variable-width encoding, Unicode itself is variable width for what humans would consider a consider characters (due to combining code points, emoji joiners, and so on).

But std will probably get an ASCII analog to char someday, and one could use [ascii::Char; N] where applicable.


What extra utility over &str are you thinking of?

Also note that it can't be [Uft8Byte; N], because then I could overwrite one Utf8Btye with another that results in invalid UTF-8 overall. So it'd have to be some opaque FixedLengthStr<N> that only offers the same sort of access that a str offers (&mut str or &[u8], etc, but no &mut [u8]).

2 Likes

True string literals ("xyz") are stored in the binary, so you don't need to declare a fixed length str for them.

Do you mean you want to store various strs that have a maximum length in fixed sized buffers? If not, please describe what you want to do.

1 Like

One use is for when you have Vec<&'static str> with a bunch of short, equal length strings. Each &'static str is two whole pointers in size, but Vec<FixedSizeStr<N>> could be much smaller. They're also contiguous in memory and less indirect.

Yes, and the obvious solution is to use Vec<u8> and do the conversions between &[u8] and &str. But I don't know if that's what the OP needs.

I don't have a specific use in mind, since str::len is const, but it seems weird that b"hello" is a narrow &[u8; 5] but "hello" is a wide &str. I can either express in the type system that a string has a particular length in bytes, or that it's valid UTF-8, but not both.

1 Like

That's correct and I don't know what else to add. (You said you thought it would be useful for string literals, that's the only reason I asked.)

When do you realistically have this in practice though unless it’s because they’re all ascii-only?

Yeah, never. [ascii::Char; N] would work, stability permitting.

Well, hypothetically, an application might be generating text using a specific hardcoded group of characters that doesn't span multiple UTF-8 lengths and is not ASCII — for example, digits other than the Arabic digits in ASCII, or one of the variations like ⓪①②③… (which is not encoded in numeric sequence!), or any other set of characters that is used in a systematic fashion — most such groups will be all 2, all 3, or all 4 bytes.

These could of course be encoded to UTF-8 from a [char], but one might want to skip the UTF-8 encoding step. It’s unlikely that such a text generator would need to be optimized heavily for speed, but one might want to optimize it for code size, especially if, say, it's a library that will go into WASM applications which often won’t use all of its capabilities, so the overhead of its support for particular outputs should be kept small.

This is a set of requirements that won’t especially commonly to appear together, and certainly not so often as to deserve language syntax support, but none of them is individually unrealistic.


That said, in that special case, it would not be hard to implement by slicing a plain string; it has a requirement about what characters appear in the string that isn't statically checked, but that could be done with a const assertion or unit test.

fn circled_number(i: u8) -> Option<&'static str> {
    // Every character in this string is 3 bytes long.
    let s = "\
        ⓪①②③④⑤⑥⑦⑧⑨\
        ⑩⑪⑫⑬⑭⑮⑯⑰⑱⑲\
        ⑳㉑㉒㉓㉔㉕㉖㉗㉘㉙\
        ㉚㉛㉜㉝㉞㉟㊱㊲㊳㊴\
        ㊵㊶㊷㊸㊹㊺㊻㊼㊽㊾㊿";
    // Once this is tested you can use unsafe { get_unchecked() }
    // and skip the UTF-8 boundary check
    s.get(usize::from(i) * 3..usize::from(i + 1) * 3)
}
fn main() {
    println!(
        "{}{}{}",
        circled_number(0).unwrap(),
        circled_number(34).unwrap(),
        circled_number(50).unwrap()
    );
}

In the get_unchecked form (with an input range check, so only assuming the validity of s, not being unsound), this compiles down to just 44 bytes of x86-64 code and 153 bytes of lookup table, for a total of 197 bytes — which is slightly smaller than the equivalent [char; 51] that also requires UTF-8 encoding when it’s used!


But getting back on topic after that fun tangent on code size micro-optimization: my point is that ASCII isn't the only useful uniform-length subset of Unicode, just the most common one by far.

1 Like

Well… good thing it isn’t actually that hard to hand-roll your own FixedLengthStr<N>.

#[derive(Copy, Clone)]
pub struct FixedLengthStr<const N: usize>([u8; N]);

impl<const N: usize> FixedLengthStr<N> {
    pub const fn from_bytes(bytes: [u8; N]) -> Option<Self> {
        if std::str::from_utf8(&bytes).is_ok() {
            Some(Self(bytes))
        } else {
            None
        }
    }
}

impl<const N: usize> std::ops::Deref for FixedLengthStr<N> {
    type Target = str;
    fn deref(&self) -> &str {
        unsafe { std::str::from_utf8_unchecked(&self.0) }
    }
}

macro_rules! s {
    ($e:literal) => {
        const {
            const __INPUT_EXPRESSION: &str = $e;
            FixedLengthStr::<{ __INPUT_EXPRESSION.len() }>::from_bytes(
                *__INPUT_EXPRESSION.as_bytes().first_chunk().unwrap(),
            )
            .unwrap()
        }
    };
}

fn _test() {
    let _a = s!("hi"); // inference works
    let _a: FixedLengthStr<2> = s!("hi"); // explicit type
    let _a: &'static FixedLengthStr<2> = &s!("hi"); // static lifetime
    let _a: &'static str = &s!("hi"); // coerces to str
}

Which results in quite nice assembly, too if used e.g. as:

fn circled_number(i: u8) -> Option<&'static str> {
    Some([
        s!("⓪"),
        s!("①"),
        s!("②"),
        s!("③"),
        s!("④"),
        s!("⑤"),
        s!("⑥"),
        s!("⑦"),
        s!("⑧"),
        s!("⑨"),
        s!("⑩"),
        s!("⑪"),
        s!("⑫"),
        s!("⑬"),
        s!("⑭"),
        s!("⑮"),
        s!("⑯"),
        s!("⑰"),
        s!("⑱"),
        s!("⑲"),
        s!("⑳"),
        s!("㉑"),
        s!("㉒"),
        s!("㉓"),
        s!("㉔"),
        s!("㉕"),
        s!("㉖"),
        s!("㉗"),
        s!("㉘"),
        s!("㉙"),
        s!("㉚"),
        s!("㉛"),
        s!("㉜"),
        s!("㉝"),
        s!("㉞"),
        s!("㉟"),
        s!("㊱"),
        s!("㊲"),
        s!("㊳"),
        s!("㊴"),
        s!("㊵"),
        s!("㊶"),
        s!("㊷"),
        s!("㊸"),
        s!("㊹"),
        s!("㊺"),
        s!("㊻"),
        s!("㊼"),
        s!("㊽"),
        s!("㊾"),
        s!("㊿"),
    ].get(i as usize)?)
}
fn main() {
    println!(
        "{}{}{}",
        circled_number(0).unwrap(),
        circled_number(34).unwrap(),
        circled_number(50).unwrap()
    );
}

Edit: I found out how to make(the unsafe version of) your code compile better, too: replacing usize::from(i + 1) * 3 with usize::from(i) * 3 + 3 did the trick!
Edit2: Oh well, the minimal change is actually to replace usize::from(i + 1) with (usize::from(i) + 1). Why would that even make a difference?

3 Likes

usize::from(i + 1) has to truncate in release mode if i is u8::MAX.

Yes, but in this case it's only in a i <= 50 branch.

I actually have had this thought myself already, and even tried many variants. E. g. all kinds of additional asserts, even unreachable_unchecked, even making the >50 case fully UB for the whole function to begin with, and none of it seems to have had the same effect that moving the +1 out to the usize had.