I was reading through the rust book on strings and it says that strings are vec of u8 and that is why they can't be indexed. however there is a utf char type so i'm wondering why rust strings arent vec of char ?
First, char
is four-byte long, and most commonly used characters are one- or two-byte long. Therefore, for the vast majority of cases, storing strings as Vec<char>
will be wasteful.
Second, char
is not very often what you want to use as a unit of text - either the byte or the grapheme cluster is, and the latter is probably not something one would want to pull into the standard library anyway.
you want performance, then you use an abstract type and hide its internal implementations(so you can do a lot of optimizations inside iteratively):
Because that actually makes a bunch of things slower, as even non-english text is shorter in UTF-8 than in UTF-32 (which is what "vec of char" would) be.
And it doesn't actually solve the problems you're thinking about anyway -- you can see some examples in Some traits to ease the use of Vec of chars as a String - #5 by scottmcm -- because "one char
" isn't the useful unit to process anyway.
Obligatory link to https://utf8everywhere.org/
char
is a bad name for what it is. It's a codepoint, and it can represent different things, e.g. pairs of letters (ligatures), special control codes (text direction, symbol style), an accent that modifies previous letter, or a small fragment of a multi-codepoint emoji.
So having a Vec
of potentially fragments of symbols that don't mean anything in isolation is not useful for processing "characters" either.
If I understand it right, then "UTF" is short for "UCS Transformation Format" which, in turn, is short for "Universal Coded Character Set Transformation Format. Thus "UTF" specifies a particular format to encode elements of the character set.
The documentation of the char
type says:
The
char
type represents a single character. More specifically, since โcharacterโ isnโt a well-defined concept in Unicode,char
is a โUnicode scalar valueโ.
If I'm not mistaken, there is no specification on UTF-32 being used. Thus I wouldn't see char
as an "utf" type (strictly speaking). Maybe it's kinda implied though? The documentation later mentions that 4 bytes are being used:
char
is always four bytes in size. This is a different representation than a given character would have as part of aString
.
So in my understanding char
is 4 bytes and somehow represents a Unicode scalar value. Actually endianess can differ on different platforms (which results in a different representation on different platforms when viewing the memory as bytes). str
, in contrast, is guaranteed to use a particular representation (UTF-8).
Here is an example of how several codepoints may represent a single glyph:
fn main() {
let m = "\u{1F46E}\u{200D}\u{2642}";
let x = "\u{1F46E}";
let f = "\u{1F46E}\u{200D}\u{2640}";
println!("{m} ({} chars and {} bytes)", m.chars().count(), m.len());
println!("{x} ({} chars and {} bytes)", x.chars().count(), x.len());
println!("{f} ({} chars and {} bytes)", f.chars().count(), f.len());
}
Output:
๐ฎโโ (3 chars and 10 bytes)
๐ฎ (1 chars and 4 bytes)
๐ฎโโ (3 chars and 10 bytes)
(Depending on your font, you might need to look closely to see the difference.)
If we cut one of these strings at the codepoint boundaries, we get this:
fn main() {
let mut buf = [0u8; 4];
let m = "\u{1F46E}\u{200D}\u{2642}";
println!("{m} ({} chars and {} bytes)", m.chars().count(), m.len());
println!("First `char`: '{}'",
m.chars().skip(0).next().unwrap().encode_utf8(&mut buf)
);
println!("Second `char`: '{}'",
m.chars().skip(1).next().unwrap().encode_utf8(&mut buf)
);
println!("Third `char`: '{}'",
m.chars().skip(2).next().unwrap().encode_utf8(&mut buf)
);
}
Output:
๐ฎโโ (3 chars and 10 bytes)
First `char`: '๐ฎ'
Second `char`: 'โ'
Third `char`: 'โ'
Where the second "char
" is the Zero-width joiner.
There are other weird characters like that such as optional hyphens, byte-order marks (BOM), modifying accents, etc., etc.
My point is: Processing Unicode is always difficult and requires a lot of care. Indexing the "Unicode Scalar Values" or codepoints doesn't make sense in all cases (as you might end up suddenly seeing symbols like as in the above example, which are invisible otherwise.
More things that can go wrong when cutting Unicode strings into pieces:
fn main() {
let s1 = "\u{202E}prime\u{202C}";
let s2 = &s1[
s1.char_indices().skip(1).next().unwrap().0..
s1.char_indices().skip(6).next().unwrap().0
];
let s3 = &s1[
0..
s1.char_indices().skip(6).next().unwrap().0
];
println!("s1 = {s1}");
println!("s2 = {s2}");
println!("s3 = {s3}");
println!();
println!("According to Wikipedia, an {s1} is a {s2} number that results in a different {s2} when its decimal digits are reversed.");
println!();
println!("But we must be careful when playing with text directionality:");
println!("According to Wikipedia, an {s3} is a {s2} number that results in a different {s2} when its decimal digits are reversed.");
}
(Proper display when opening the playground may depend on your browser.)