How to return static str?

Yes, I agree with you, but my point was if I decide to allocate 10 characters in utf-8 (not 10 bytes) I would describe it rather as [u32; 10] than [u8; 40]. So the workaround as StringArray<40> is not user/programmer friendly in my opinion. I would prefer StringArrayUtf8<10> instead.

[u32; 10] would be a UTF-32 encoding, not UTF-8. If you want to use UTF-32, you can also use [char; 10].

If you want a stack-allocated UTF-32 encoded string of up to 10 characters, you can also use ArrayVec<char, 10>.

Is there a specific reason you want to limit the length of your strings in UTF-32 encoding specifically, rather than the UTF-8 length?

6 Likes

Every character in utf-8 has a various length. Up to 4 bytes. "A" takes 1 byte, "Ą" takes 2 bytes, "汉" takes 4 bytes (if I'm correct). That is why my example (Rust Playground) didn't work because it required 11 bytes:
O - 1 byte
S - 1 byte
T - 1 byte
R - 1 byte
O - 1 byte
Ł - 2 bytes
Ę - 2 bytes
K - 1 byte
A - 1 byte

So when I build an app, there is a user who wants to enter his name, it could be "John" or "李祖阳". So in my case I would allocate [u32; 10] for field name. The field has a client side validation against the length as well (max 10 characters).

So the StringArray should respect the length of the characters (not bytes). Because it is an array of characters, not bytes, and what @alice has pointed out ([char; 10]).

ArrayString represents strings internally in UTF-8. So the representation is actually an array of UTF-8 bytes, not chars. This works well with the standard library, because Rust standard library has support for UTF-8 strings. The built-in str type represents strings in UTF-8, and you can easily convert between str and ArrayString and String etc.

You can still perform additional validation if for some reason you want to make sure the number of code points is at most 10, for instance:

if s.chars().count() > 10 {
    // tell user their name is too long
}

40 UTF-8 bytes can store strictly more names than just those limited to 10 code points, so not sure you want to do such validation. Why reject a name such as "abcdefghijklm" if it fits in your 40 bytes of memory? But you can if you want to.

But if for some reason you insist your internal representation must be in terms of code points, i.e. UTF-32, then like I said, you can use ArrayVec<char, 10>. You'll get less support from the standard library but you can do this.

3 Likes

I must describe it in the api documentation and explicitly tell what is the max length. Second reason it could be UI issue if the name comes with the length greater than what the design says.
Sorry but it goes into the direction where nobody knows how the program actually works and it always ends up with issues and errors. I cannot do programming in that way.

It is, just use String. So far you have not given a proper reason of why you're avoiding heap allocation, it has negligible performance cost, unless you're running your app on bare-metal, you should just use String.

I, personally, am not aware of any language that has stack allocated strings other than C/C++, and it's because they used ascii. Not to mention C is way more of hassle than Rust.

Again, String is what you're looking for.

6 Likes

Strings are complicated because strings are complicated. I wonder how would you express your intention in any other language, C/C++, for example.

2 Likes

It is perfectly possible and reasonable to describe that the input must be such that its UTF-8 encoded length fits into 40 bytes, for example.

Second, you are making a mistake by thinking that UTF-32 or Vec<char> will give you "the number of user-perceived characters" or "the width of the text as it fits into the design".

It won't.

That would be grapheme clusters, and in Rust, a char is merely a code point, not a fully-fledged grapheme cluster.

Now grapheme clusters can be composed of any number of bytes – UTF-8 is a variable-width encoding at two levels, and the actual "bytes to code points" part is easy – you can have at most 4 times as many bytes as code points. But the "grapheme clusters to code points" part isn't easy – you have to parse and interpret the whole string in order to find out how many grapheme clusters it has, and it may be arbitrarily long (in terms of code points or bytes)!

And then for the purposes of actually displaying those characters, there are the questions of:

  • Unicode width – some grapheme clusters have "double" width, e.g. they take up significantly more space than conventional characters, for example, emojis
  • The choice of font – fixed-width, variable-width, kerning settings, ligatures, etc.

All of this basically means that if you want to have a bulletproof way of deciding whether your users' input data fits in your designers' text box, you have to actually go and render the text using the font it will be displayed with, and check how long it is. There is no other reliable way. No amount of "limiting the stack space" will get you there.

And for this, you have to understand the issue. You hit a rather accusatory tone with

nobody knows how the program actually works

It seems to me that you are the one who aren't aware of all the issues necessary for implementing what you want. This is neither the fault of Rust nor the fault of the people trying to answer your questions here.

The hard truth is that text handling and layout is more complicated than assert!(username.len() <= 40), and Rust can't help you make decisions about things that inherently require more information (and neither can any other language, even if they pretend to do so). So please start by learning more about Unicode and text rendering in general before frustratedly denouncing those who are trying to help you.

12 Likes

I have found an interesting crate (https://crates.io/crates/arraystring). I will try it out later today.
Another interesting one: https://crates.io/crates/string-wrapper

Ok, I decided to go with [u8; N] for ANSI and [char; N] or [u32; N] for UTF-8. I cannot see any better solution for stack allocated sized strings. Everything else is going to String of course.

There is no reason to use String Arrays or String Wrappers, because they all are based on [u8]. The only thing I would consider is a helper to explicitly convert an array from/into str / String.

All crates I have found so far are just redundant improvements with some limitations (like ANSI only) or extra complexity (like dynamic stack/heap). For me the code must be short, clean and straight forward.

I wish rust team could improve the string support in future. Rust should be simple and for everyone. Eg.:

// ANSI (Stack allocated)
{
    let mut myStr = str_ansi::new(3); // [u8; 3]
    myStr = "Cat";
    println!("Value: {myStr}, Capacity: {myStr.capacity()}, Length: {myStr.len()}");
    myStr = "Bear"; // Compile Error
}

// UTF (Stack allocated)
{
    let mut myStr = str_utf::new(3); // [u32; 3]
    myStr = "Bąk";
    println!("Value: {myStr}, Capacity: {myStr.capacity()}, Length: {myStr.len()}");
    myStr = "Wróbl"; // Compile Error
}

Related topic: rust - Why can fixed-size arrays be on the stack, but str cannot? - Stack Overflow

Thanks everyone for your incredible help and support. :+1: :smiling_face_with_three_hearts:

:face_with_raised_eyebrow:

2 Likes

You may want bstr or ascii to improve ergonomics for the byte-sized encodings. They can handle stack allocated arrays.

Between [char; N] and [u32; N], I'd suggest the former as you'll know you have valid Unicode.


[char; N] gives you a simple limit in terms of code points, and maybe that's adequate for your use case. But "no more than 40 code points" isn't going to correspond to anything a non-technical user cares about in the general case, e.g. any sort of visual-based character count. Those more closely correspond to grapheme clusters.

For example, the name in this SO thread has 3 user-perceived characters but consists of 6 code points, is normalized, and passes the suggested name validation checks. [1]

That's part of the travails of Unicode, not anything Rust specific: characters in the visual sense are variable length, even in UTF32. Thus in the general case, there is no meaningful and simple way to describe a length limit in a non-technical setting. "N code points" will likely have less variability over all inputs than "N UTF8 bytes", but it's still variable as far as visually perceived characters go. Especially if you're not normalizing. [2]

So if we really are talking about names here, your best bet may be to pick a sufficiently high length that the possibilities of a name being that long at all are very low; anyone who hits the limit will probably be annoyed, but used to it.

In a technical setting, any encoding seems sufficient to convey a length limit to me, so I'd consider other factors when choosing how best to back stack-based unicode strings.


  1. And also happens to be a valid Rust identifier. ↩︎

  2. All this variability is probably a big part of why even the stack-based Unicode string libraries that have been mentioned above support a fixed capacity with variable length, in contrast with a fixed length period. Outside of technical considerations like a storage limit or code optimizations, something having a fixed number of code units just doesn't mean much in Unicode, at least not without adding a bunch of restrictions that exclude some languages. ↩︎

12 Likes

Again and again (so many times I say the same), the main problem is because you measure it in bytes, not in characters. When we declare [char; 10], we mean 10 characters (which is 40 bytes).

The second problem is variable length. It is because of UTF is represented as [u8], not as [u32]. So that is why we have all those problems with &str (because we want to fit UTF into the array of u8). If we simply say that str_ansi is [u8; N] and str_utf is [u32; N] all problems with "unsized" &str disappear.

So returning back to grapheme clusters issue. I don't care as long as a can fit those values into my [u8; N] (for ansi) or [u32/char; N] (for utf) arrays I declared (from my backend perspective), (frontend devs should deal on their own).

str_ansi: [u8; 3] // stack allocated (known size)
A - x41
B - x42
C - x43

str_unicode: [u16; 3] // stack allocated (known size)
A - x0041
Ŭ - x016C
Ḉ - x1E08

str_utf: [u32; 3] // stack allocated (known size)
A -  x00000041
Ŭ -  x0000016C
🜆 - x0001F706

BTW I am not telling that unsized &str and String are bad, they serve they purpose very well.

Falsehoods programmers believe about names.

Use String, set your database column to utf8 text, don't put egregious length limits. Let your tools work for you.

Maybe this will help you understand better, this article has helped many rust beginners:

You cannot treat strings as simply a list of codepoints.

3 Likes

No. Despite how convenient it would be, "how many characters are in a string" is not a well-formed question across multiple scripts.

Quoting the Unicode FAQ:

Q: How are characters counted when measuring the length or position of a character in a string?

Computing the length or position of a "character" in a Unicode string can be a little complicated, as there are four different approaches to doing so, plus the potential confusion caused by combining characters. The correct choice of which counting method to use depends on what is being counted and what the count or position is used for.

Each of the four approaches is illustrated below with an example string <U+0061, U+0928, U+093F, U+4E9C, U+10083>. The example string consists of the Latin small letter a, followed by the Devanagari syllable "ni" (which is represented by the syllable "na" and the combining vowel character "i"), followed by a common Han ideograph, and finally a Linear B ideogram for an "equid" (horse):

image

  1. Bytes: how many bytes (what the C or C++ programming languages call a char) are used by the in-memory representation of the string; this is relevant for memory or storage allocation and low-level processing.

Here is how the sample appears in bytes for the encodings UTF-8, UTF-16BE, and UTF-32BE:

Encoding Byte Count Byte Sequence
UTF-8 14 61 E0 A4 A8 E0 A4 BF E4 BA 9C F0 90 82 83
UTF-16BE 12 00 61 09 28 09 3F 4E 9C D8 00 DC 83
UTF-32BE 20 00 00 00 61 00 00 09 28 00 00 09 3F00 00 4E 9C 00 01 00 83
  1. Code units: how many of the code units used by the character encoding form are in the string; this may be relevant, for example, when declaring the size of a character array or locating the character position in a string. It often represents the "length" of the string in APIs.

Here is how the sample appears in code units for the encodings UTF-8, UTF-16, and UTF-32:

Encoding Code Unit Count Code Unit Sequence
UTF-8 14 61 E0 A4 A8 E0 A4 BF E4 BA 9C F0 90 82 83
UTF-16 6 0061 0928 093F 4E9C D800 DC83
UTF-32 5 00000061 00000928 0000093F 00004E9C 00010083
  1. Code points: how many Unicode code points—the number of encoded characters—that are in the string. The sample consists of 5 code points (U+0061, U+0928, U+093F, U+4E9C, U+10083), regardless of character encoding form. Note that this is equivalent to the UTF-32 code unit count.

  2. Grapheme clusters: how many of what end users might consider "characters". In this example, the Devanagari syllable "ni" must be composed using a base character "na" (न) followed by a combining vowel for the "i" sound ( ि), although end users see and think of the combination of the two "नि" as a single unit of text. In this sense, the example string can be thought of as containing 4 “characters” as end users see them. A default grapheme cluster is specified in UAX #29: Unicode Text Segmentation, as well as in UTS #18: Unicode Regular Expressions.

The choice of which count to use and when depends on the use of the value, as well as the tradeoffs between efficiency and comprehension. For example, Java, Windows, and ICU use UTF-16 code unit counts for low-level string operations, but also supply higher level APIs for counting bytes, characters, or denoting boundaries between grapheme clusters, when circumstances require them. An application might use these to, say, limit user input based on a number of "screen positions" using the user-perceived "character" (grapheme cluster) count. Or the application might have an internal limit based on storage allocation in a database field counted in bytes. This approach allows for efficient low-level processing, with allowance for higher-level usage. However, for a very high-level application, such as word-processing macros, grapheme clusters alone may be sufficient.

It can get even more fun the more you look into details, because different cultures can look at the exact same glyph which is unified in the Unicode encoding and give you different answers for how many "user perceived characters" it is.


[u8] and [char] are also unsized types. If you want a fixed sized by-value str, you can use ArrayStr or a similar custom type. Any such type will be parameterized by the largest string it accepts, and that limit will typically be expressed in Code Units. (For UTF-8, which Rust uses (and you should too; UTF-32 is actively a bad choice for 99% of applications), that's 1–4 u8 Code Units per Code Point.)

The lack of a built-in array type equivalent for str is unfortunate, but not a big issue in practice, because fixed size strings are exceedingly rare, and when they do exist, are typically better stored as a different format (because they're e.g. a stringified UUID which is better represented in its numeric form) or at least are specialized enough anyway that they want a custom newtype around [u8; N] anyway.


And as a tertiary note: eliminate the string size limits. If it's a user-fillable field, it should have some limit to prevent abuse, but it should be set way beyond what anyone is going to run into, e.g. perhaps 1KiB for singleline text entry, 1MiB for multiline. The only reason for fixed-sized keys would be for things like DB keys, and those should be numeric (e.g. UUIDs), not strings. Anything human-facing changes.

It is common to store non-textual data in a textual format for ease of identification and transferring through systems not designed for binary data. In such a system it can be reasonable to keep the stringified blob as an opaque blob. But it's not a string, despite being textual; it's still whatever data got stringified, just in a different format. If you're moving around stringified UUIDs, you're still working with UUIDs, not strings.


And to belabor the point: don't use UTF-32; don't use [char]. UTF-32 is not a fixed length encoding. á may be one code point or two code points, depending on how it was typed and what input method the typist was using. :family_woman_woman_girl_girl: is seven code points, and even more if you introduce skin tone modifiers.

It is the ire of many developers, but there is no silver bullet for a user-friendly maximum length.

14 Likes

Do you think I should use unlimited TEXT type on all my columns in the db? I think it is suicide. It is going to be non clustered and super slow.

Yes. Some people names' are a telling of their life. And it changes with time.

(If it's MySQL or a sibling, use utf8mb4, not utf8. In MySQL databases, utf8 is utf8mb3, a legacy nonstandard encoding which should never be used except to migrate old tables using it to utf8mb4. Also use TEXT columns, not fixed size CHAR columns, for them.)

It used to be that DBs needed to use fixed-width columns for performance reasons. That is no longer the case, as every production DB has improved performance working with TEXT columns. (Which shouldn't be the index columns. Use numeric keys (e.g. UUIDs) for association.) In fact, using fixed size columns is nowadays often a pessimization, because that requires overallocating space for small rows just for the small number of rows which need the space. TEXT or your DBs equivalents handle these cases of common small, uncommon large perfectly.

7 Likes

Sorry, I ment [u8; N] and [char; N] in that sentence.

Ok. Three simple questions to everyone:

  1. Is [u32; 3] a known sized string and can be stack allocated?
  2. Can I fit ANY up to 3 UTF-8 characters into the [u32; 3] variable?
  3. Would [u32; 3] have better performance over String (allocating 1 million of variables)?

BTW: I know [u32; 3] would take more memory space over str/String depends on the characters. This is a tradeoff.

[u32; N] is an array of u32, not a string. So no, it's not a string, it's an array of numbers. [char; N] is an array of codepoints, which can be treated as a UTF-32 string (but you should not be using UTF-32).

No, there is no such thing as a "UTF-8 character". A Unicode Code Point is 32 bits, and is best stored as char, not u32.

A UTF-8 Code Unit is u8. It takes 1–4 UTF-8 Code Units to encode a Code Point.

Exactly (not up to) 3 Code Points are stored in [char; 3].

The number of "characters" depends on how you define characters. (Fun fact: Twitter is perhaps the most notable site with a low character count limit on posts. They use a custom in-house solution for counting characters which is distinctly not based on the number of Code Points or Code Units. There's some correlation, but it's not consistent.)

If the only purpose is comparing for equality, then yes, using [u8; N] or an array string type will be more efficient to copy around than cloning String.

And if the only purpose is comparison, there's no reason to use UTF-32 rather than UTF-8. So just use UTF-8. If text is user entered, it's not going to be a fixed number of code points long, so there is no benefit to pretending UTF-32 gives you a fixed size. Just use UTF-8.

And if it's a text encoding of numeric data, that's even more reason not to use UTF-32. Either use a newtype around [u8; N] if you want to treat it as an opaque identifier, or actually handle it as the numeric data it actually is.

2 Likes