Question about the Stack memory and Strings

In Rust's official book, it says:

[...]
A String is made up of three parts, shown on the left: a pointer to the memory that holds the contents of the string, a length, and a capacity. This group of data is stored on the stack (see image below).

I'm a bit of a beginner, my question is if this "allocation trifecta" for Strings is a common representation convention across all programming languages, or is this Rust's own way to represent Strings?

By "allocation trifecta" I mean this "pointer-length-capacity" trio to represent strings. I guess it has a more formal, technical name so apologies upfront :slight_smile:

Can someone point me to an appropriate source where I can read more on this? I can find hundreds of "memory allocation" articles on the internet, but I had no luck for this particular subject. Thank you.

imagen

It is by no means universal, but it's probably not unique to Rust. Rust does not provide a stable ABI for the actual layout of the three fields; in fact the length currently is after the capacity.

2 Likes

Thank you Tom. Do you have any external articles you can point me to? I don't know what an "ABI" is either, a quick search in Google tells me it's an Application Binary Interface. I guess I'll have to read a book on Computer Science before taking on Rust/C :sweat_smile: :joy:

"Stable ABI" just means that something is guaranteed to be stable so that the specified details can be relied upon by other programs. When programs interact, they ultimately do so through an ABI, because at the hardware level everything must be specified without ambiguity. Those specifications are about representations of details as values, layouts, etc.

I probably should have used the acronym API (application programming interface) instead, though with respect to layout ABI is perhaps more correct.

Technically we're talking about a "string buffer" here. This buffer is use to manage the memory to store an object (in this case a string). Rust somewhat confusingly calls the string buffer type String and the string itself &str. The &str is just the ptr and length without the capacity so it can't grow like the buffer can.

To answer your question, it's a common enough representation for a buffer. For example, Microsoft's NT kernel (which dates back to 1993) uses something similar for its string buffers:

struct UNICODE_STRING {
  USHORT Length;         // len
  USHORT MaximumLength;  // capacity
  PWSTR  Buffer;         // ptr
} 

However, it's not the only way to do it. For example, the length can be encoded by marking the end of the string with a zero byte (aka "null terminated" strings). The allocation capacity could be stored separately in a lookup table. Or strings could always be immutable so they can never ever grow or shrink.

I'm not sure where to find a good introductory article, sorry. Then again I don't think you need to thoroughly understand all this before getting started with Rust. In fact Rust might be a good way to pick up some computer science concepts as you go.

3 Likes

I would strongly suggest you do not do that. Computer Science is for the most part far removed from such practical matters and get's very mathematical. Most programmers do not know most of Computer Science nor do they need to. No more than bookkeepers know calculus or number theory.

As the famous Edsger Dijkstra said many years ago:

Computer science is no more about computers than astronomy is about telescopes.

It sounds to me like you would benefit from some far more down to Earth a practical experience. How does a compute work? What are instructions? What is a variable? What is stack space? What is heap space? What actually is memory on a computer?

A little time spent learning how to program in assembler would enlighten all of that. As it did for us back in the day when we started out with BASIC but were expected to be aware of such low level things.

As for the String trifecta:

A string is just a bunch of bytes in memory somewhere. You need to know where it is to use it. That is it's memory address. If you have a variable that holds that address you have a "pointer". That is trifecta item 1.

Item 2 is that you need to know where that bunch of bytes in memory ends. The end of the string. A length variable would do to tell you that. With that you know how many bytes up from your pointer are part of your string.

Item 3 is that possibly that bunch of bytes in memory has a lot of free, unused, space, after wherever the length says it ends. That is the capacity. You could add more characters to the string in that free space and adjust the length accordingly.

Note that C has only the first item of your trifecta. C has pointers to strings. But it has no idea of the length. The end of a string is marked by a zero byte.

Actually C has no such data type as string. Only pointers to where characters are supposed to be. You have to make it up as you go along.

2 Likes