str
is a primitive type in Rust. String literals in Rust are &str
by default.
Is there anyway to construct a raw str
? Or is raw str
meaningful in Rust?
str
is an unsized type, which means you can't directly own it. You can only ever have it behind some kind of reference (such as Box<str>
). The backing data, however, could live anywhere, be unsized or sized; the only requirement is that it's a valid sequence of UTF-8 bytes.
A metaphor I read somewhere is that str
and [T]
are a bit like the water in a glass. They can be kept in different containers, but need at least something to hold them. str
is typically stored in String
or the static memory, and [T]
is typically the content of an array or Vec<T>
. They are the sequences of data inside those containers, not containers on their own.
Is there an "sized array of UTF-8-encoded bytes" type in the analogy of "??? is to str
as [u8; N]
is to [u8]
?
Not natively in std, because it turns out that such a type isn't very useful. Since UTF8 is a variable length encoding, even properly generalized uppercasing may need to add or remove characters. You'd have to determine the size of the container in bytes, not characters, which is a little counter intuitive.
That said, the arrayvec
crate provides an ArrayString
type. Others have also done the same, though they aren't as popular.
I think such a type would be useful for string literals
Just adding on to what @Kyllingene said:
Not only is UTF-8 is a variable-width encoding, Unicode itself is variable width for what humans would consider a consider characters (due to combining code points, emoji joiners, and so on).
But std
will probably get an ASCII analog to char
someday, and one could use [ascii::Char; N]
where applicable.
What extra utility over &str
are you thinking of?
Also note that it can't be [Uft8Byte; N]
, because then I could overwrite one Utf8Btye
with another that results in invalid UTF-8 overall. So it'd have to be some opaque FixedLengthStr<N>
that only offers the same sort of access that a str
offers (&mut str
or &[u8]
, etc, but no &mut [u8]
).
True string literals ("xyz"
) are stored in the binary, so you don't need to declare a fixed length str
for them.
Do you mean you want to store various str
s that have a maximum length in fixed sized buffers? If not, please describe what you want to do.
One use is for when you have Vec<&'static str>
with a bunch of short, equal length strings. Each &'static str
is two whole pointers in size, but Vec<FixedSizeStr<N>>
could be much smaller. They're also contiguous in memory and less indirect.
Yes, and the obvious solution is to use Vec<u8>
and do the conversions between &[u8]
and &str
. But I don't know if that's what the OP needs.
I don't have a specific use in mind, since str::len
is const, but it seems weird that b"hello"
is a narrow &[u8; 5]
but "hello"
is a wide &str
. I can either express in the type system that a string has a particular length in bytes, or that it's valid UTF-8, but not both.
That's correct and I don't know what else to add. (You said you thought it would be useful for string literals, that's the only reason I asked.)
When do you realistically have this in practice though unless it’s because they’re all ascii-only?
Yeah, never. [ascii::Char; N]
would work, stability permitting.
Well, hypothetically, an application might be generating text using a specific hardcoded group of characters that doesn't span multiple UTF-8 lengths and is not ASCII — for example, digits other than the Arabic digits in ASCII, or one of the variations like ⓪①②③… (which is not encoded in numeric sequence!), or any other set of characters that is used in a systematic fashion — most such groups will be all 2, all 3, or all 4 bytes.
These could of course be encoded to UTF-8 from a [char]
, but one might want to skip the UTF-8 encoding step. It’s unlikely that such a text generator would need to be optimized heavily for speed, but one might want to optimize it for code size, especially if, say, it's a library that will go into WASM applications which often won’t use all of its capabilities, so the overhead of its support for particular outputs should be kept small.
This is a set of requirements that won’t especially commonly to appear together, and certainly not so often as to deserve language syntax support, but none of them is individually unrealistic.
That said, in that special case, it would not be hard to implement by slicing a plain string; it has a requirement about what characters appear in the string that isn't statically checked, but that could be done with a const
assertion or unit test.
fn circled_number(i: u8) -> Option<&'static str> {
// Every character in this string is 3 bytes long.
let s = "\
⓪①②③④⑤⑥⑦⑧⑨\
⑩⑪⑫⑬⑭⑮⑯⑰⑱⑲\
⑳㉑㉒㉓㉔㉕㉖㉗㉘㉙\
㉚㉛㉜㉝㉞㉟㊱㊲㊳㊴\
㊵㊶㊷㊸㊹㊺㊻㊼㊽㊾㊿";
// Once this is tested you can use unsafe { get_unchecked() }
// and skip the UTF-8 boundary check
s.get(usize::from(i) * 3..usize::from(i + 1) * 3)
}
fn main() {
println!(
"{}{}{}",
circled_number(0).unwrap(),
circled_number(34).unwrap(),
circled_number(50).unwrap()
);
}
In the get_unchecked
form (with an input range check, so only assuming the validity of s
, not being unsound), this compiles down to just 44 bytes of x86-64 code and 153 bytes of lookup table, for a total of 197 bytes — which is slightly smaller than the equivalent [char; 51]
that also requires UTF-8 encoding when it’s used!
But getting back on topic after that fun tangent on code size micro-optimization: my point is that ASCII isn't the only useful uniform-length subset of Unicode, just the most common one by far.
Well… good thing it isn’t actually that hard to hand-roll your own FixedLengthStr<N>
.
#[derive(Copy, Clone)]
pub struct FixedLengthStr<const N: usize>([u8; N]);
impl<const N: usize> FixedLengthStr<N> {
pub const fn from_bytes(bytes: [u8; N]) -> Option<Self> {
if std::str::from_utf8(&bytes).is_ok() {
Some(Self(bytes))
} else {
None
}
}
}
impl<const N: usize> std::ops::Deref for FixedLengthStr<N> {
type Target = str;
fn deref(&self) -> &str {
unsafe { std::str::from_utf8_unchecked(&self.0) }
}
}
macro_rules! s {
($e:literal) => {
const {
const __INPUT_EXPRESSION: &str = $e;
FixedLengthStr::<{ __INPUT_EXPRESSION.len() }>::from_bytes(
*__INPUT_EXPRESSION.as_bytes().first_chunk().unwrap(),
)
.unwrap()
}
};
}
fn _test() {
let _a = s!("hi"); // inference works
let _a: FixedLengthStr<2> = s!("hi"); // explicit type
let _a: &'static FixedLengthStr<2> = &s!("hi"); // static lifetime
let _a: &'static str = &s!("hi"); // coerces to str
}
Which results in quite nice assembly, too if used e.g. as:
fn circled_number(i: u8) -> Option<&'static str> {
Some([
s!("⓪"),
s!("①"),
s!("②"),
s!("③"),
s!("④"),
s!("⑤"),
s!("⑥"),
s!("⑦"),
s!("⑧"),
s!("⑨"),
s!("⑩"),
s!("⑪"),
s!("⑫"),
s!("⑬"),
s!("⑭"),
s!("⑮"),
s!("⑯"),
s!("⑰"),
s!("⑱"),
s!("⑲"),
s!("⑳"),
s!("㉑"),
s!("㉒"),
s!("㉓"),
s!("㉔"),
s!("㉕"),
s!("㉖"),
s!("㉗"),
s!("㉘"),
s!("㉙"),
s!("㉚"),
s!("㉛"),
s!("㉜"),
s!("㉝"),
s!("㉞"),
s!("㉟"),
s!("㊱"),
s!("㊲"),
s!("㊳"),
s!("㊴"),
s!("㊵"),
s!("㊶"),
s!("㊷"),
s!("㊸"),
s!("㊹"),
s!("㊺"),
s!("㊻"),
s!("㊼"),
s!("㊽"),
s!("㊾"),
s!("㊿"),
].get(i as usize)?)
}
fn main() {
println!(
"{}{}{}",
circled_number(0).unwrap(),
circled_number(34).unwrap(),
circled_number(50).unwrap()
);
}
Edit: I found out how to make(the unsafe version of) your code compile better, too: replacing usize::from(i + 1) * 3
with usize::from(i) * 3 + 3
did the trick!
Edit2: Oh well, the minimal change is actually to replace usize::from(i + 1)
with (usize::from(i) + 1)
. Why would that even make a difference?
usize::from(i + 1)
has to truncate in release mode if i
is u8::MAX
.
Yes, but in this case it's only in a i <= 50
branch.
I actually have had this thought myself already, and even tried many variants. E. g. all kinds of additional asserts, even unreachable_unchecked
, even making the >50
case fully UB for the whole function to begin with, and none of it seems to have had the same effect that moving the +1
out to the usize
had.