String implementation

I was reading through the rust book on strings and it says that strings are vec of u8 and that is why they can't be indexed. however there is a utf char type so i'm wondering why rust strings arent vec of char ?

First, char is four-byte long, and most commonly used characters are one- or two-byte long. Therefore, for the vast majority of cases, storing strings as Vec<char> will be wasteful.

Second, char is not very often what you want to use as a unit of text - either the byte or the grapheme cluster is, and the latter is probably not something one would want to pull into the standard library anyway.

6 Likes

you want performance, then you use an abstract type and hide its internal implementations(so you can do a lot of optimizations inside iteratively):

Because that actually makes a bunch of things slower, as even non-english text is shorter in UTF-8 than in UTF-32 (which is what "vec of char" would) be.

And it doesn't actually solve the problems you're thinking about anyway -- you can see some examples in Some traits to ease the use of Vec of chars as a String - #5 by scottmcm -- because "one char" isn't the useful unit to process anyway.

Obligatory link to https://utf8everywhere.org/

1 Like

char is a bad name for what it is. It's a codepoint, and it can represent different things, e.g. pairs of letters (ligatures), special control codes (text direction, symbol style), an accent that modifies previous letter, or a small fragment of a multi-codepoint emoji.

So having a Vec of potentially fragments of symbols that don't mean anything in isolation is not useful for processing "characters" either.

If I understand it right, then "UTF" is short for "UCS Transformation Format" which, in turn, is short for "Universal Coded Character Set Transformation Format. Thus "UTF" specifies a particular format to encode elements of the character set.

The documentation of the char type says:

The char type represents a single character. More specifically, since โ€˜characterโ€™ isnโ€™t a well-defined concept in Unicode, char is a โ€˜Unicode scalar valueโ€™.

If I'm not mistaken, there is no specification on UTF-32 being used. Thus I wouldn't see char as an "utf" type (strictly speaking). Maybe it's kinda implied though? The documentation later mentions that 4 bytes are being used:

char is always four bytes in size. This is a different representation than a given character would have as part of a String.

So in my understanding char is 4 bytes and somehow represents a Unicode scalar value. Actually endianess can differ on different platforms (which results in a different representation on different platforms when viewing the memory as bytes). str, in contrast, is guaranteed to use a particular representation (UTF-8).

Here is an example of how several codepoints may represent a single glyph:

fn main() {
    let m = "\u{1F46E}\u{200D}\u{2642}";
    let x = "\u{1F46E}";
    let f = "\u{1F46E}\u{200D}\u{2640}";
    println!("{m} ({} chars and {} bytes)", m.chars().count(), m.len());
    println!("{x} ({} chars and {} bytes)", x.chars().count(), x.len());
    println!("{f} ({} chars and {} bytes)", f.chars().count(), f.len());
}

(Playground)

Output:

๐Ÿ‘ฎโ€โ™‚ (3 chars and 10 bytes)
๐Ÿ‘ฎ (1 chars and 4 bytes)
๐Ÿ‘ฎโ€โ™€ (3 chars and 10 bytes)

(Depending on your font, you might need to look closely to see the difference.)

:man_police_officer:
:policeman:
:policewoman:


If we cut one of these strings at the codepoint boundaries, we get this:

fn main() {
    let mut buf = [0u8; 4];
    let m = "\u{1F46E}\u{200D}\u{2642}";
    println!("{m} ({} chars and {} bytes)", m.chars().count(), m.len());
    println!("First `char`: '{}'",
        m.chars().skip(0).next().unwrap().encode_utf8(&mut buf)
    );
    println!("Second `char`: '{}'",
        m.chars().skip(1).next().unwrap().encode_utf8(&mut buf)
    );
    println!("Third `char`: '{}'",
        m.chars().skip(2).next().unwrap().encode_utf8(&mut buf)
    );
}

(Playground)

Output:

๐Ÿ‘ฎโ€โ™‚ (3 chars and 10 bytes)
First `char`: '๐Ÿ‘ฎ'
Second `char`: 'โ€'
Third `char`: 'โ™‚'

Where the second "char" is the Zero-width joiner.

There are other weird characters like that such as optional hyphens, byte-order marks (BOM), modifying accents, etc., etc.

My point is: Processing Unicode is always difficult and requires a lot of care. Indexing the "Unicode Scalar Values" or codepoints doesn't make sense in all cases (as you might end up suddenly seeing symbols like :male_sign: as in the above example, which are invisible otherwise.

5 Likes

More things that can go wrong when cutting Unicode strings into pieces:

fn main() {
    let s1 = "\u{202E}prime\u{202C}";
    let s2 = &s1[
        s1.char_indices().skip(1).next().unwrap().0..
        s1.char_indices().skip(6).next().unwrap().0
    ];
    let s3 = &s1[
        0..
        s1.char_indices().skip(6).next().unwrap().0
    ];
    println!("s1 = {s1}");
    println!("s2 = {s2}");
    println!("s3 = {s3}");
    println!();
    println!("According to Wikipedia, an {s1} is a {s2} number that results in a different {s2} when its decimal digits are reversed.");
    println!();
    println!("But we must be careful when playing with text directionality:");
    println!("According to Wikipedia, an {s3} is a {s2} number that results in a different {s2} when its decimal digits are reversed.");
  
}

(Playground)

(Proper display when opening the playground may depend on your browser.)

2 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.