Type alignment (understanding memory layout)

Can someone help me understand terms "size", "alignment", and "padding"? "Alignment" is of particular interest.

To the best of my ability, I try to read the reference first, and I did the same in this case. Unfortunately, the "reference" format itself is not intended for guidance or learning, and I do not have enough knowledge and experience to understand or even imagine what exactly the alignment is intended for.

In particular, I do not understand:

  1. The alignment of a value specifies what addresses are valid to store the value at. A value of alignment n must only be stored at an address that is a multiple of n.

    Why other addresses are invalid to store the value at?

  2. For example, a value with an alignment of 2 must be stored at an even address, while a value with an alignment of 1 can be stored at any address. Alignment is measured in bytes, and must be at least 1, and always a power of 2.

    Why alignment need to be a power of 2?


To put it bluntly, I'd like something like a visual representation of memory blocks for different alignment cases within a structure, although I will appreciate any help, if it make the topic more clearer. If layouts with stricter guarantees (#[repr("C")]) is in some way more easily to explain, use it.

Example

struct S{   // size  4, alignment  2, how does this look like?
    u: u8,  // e.g. S: |--|__|----|
    i: u16, //          u8     u16, '_' - padding
}

I want to emphasize that I am not looking for an exhaustive explanation, but I would like to make a basic view as a beginner. Please do not advice me to read Wikipedia or scholar papers on that topic and do not assume C(++) knowledge or something else except the basic Rust and the notion of ​​memory as a table with cells of the same size, which have the address, and pointer as a cell that stores the address of another cell.

2 Likes

When a type has alignment of 2^n, that means that all pointers to that type must have an address where the last n digits are zero (in binary).

  • Many CPUs have instructions that only work if the pointer is sufficiently aligned.
  • If you know that a pointer is 8-byte aligned, then you know that it's address has three zeros. You could use that space to store three bits of data.

Memory is generally organized into pages, where each page takes up 4096 bytes. If you have an 4-byte integer that is 4-byte aligned, then you know for sure that all 4 bytes are in the same page. For example, an address with 2 bytes in one page and 2 bytes in the next page is not 4-byte aligned. Having the entire integer in one page is useful for many reasons such as performance.

12 Likes

To support working with chunks of memory larger than a byte, computers have instructions that operate on multiple bytes at a time. For instance, an ADD instruction might operate on 4 or 8 bytes at a time. Memory is also generally organized in chunks ("words") larger than a byte, because (among other reasons, but on the most basic level) square-ish arrays are more efficient and easier to design with than long skinny ones.

So you generally get more than one byte out of memory at a time and the instructions your CPU supports also deal with more than one byte at a time. That means when you're dealing with a u16, you don't load one byte and then load another byte, you load the whole 16 bits at once, and the hardware is organized to efficiently load those - the first 8 bits into the low part of the register, and the second 8 bits into the high part, or vice versa, depending on endianness.

This doesn't work if the two bytes are split across two memory words, or if they're in the "middle" of the memory word when the efficient instruction expects them to be at the "end". In order to read a u16 from two bytes with arbitrary placement, you do have to read them one at a time and stitch them together on the fly - this is called doing an "unaligned read". Since unaligned access is slow, it is usual to avoid it: always put bytes that will be accessed as a unit at addresses that are properly aligned.

Simply because that makes memory design and instruction set design simpler. If you are dealing with binary addresses, your decoder doesn't have to do any math to decode an address that must be divisible by 4: just ignore the last two bits (which, as alice mentioned, may also be useful for keeping track of metadata).

Making an efficient decoder for a 7-byte-word memory array sounds like a trick question from the world's worst job interview. :thinking:

6 Likes

Thanks for the examples, they look reasonable, but I most likely should think for each of them. I will answer you by the fact that I currently did not understand from your examples:

  1. Obviously, the concept of "sufficiently aligned" requires clarification, if it is possible without digging too deeply. Otherwise, it is not clear why instructions was designed to work with such addresses. How do they use the fact of zeros at the end?
  2. If I use something like #[repr(packed)] as in this example setting alignment to 1 how these instructions will work and why is a raw pointer required to refer to such values and other methods such write_unaligned()? What if the type lay on two pages.

The simplest possible example is: literally omitting the wires that would carry those low bits of the address. (Practical cases are more complex than that.)

When code operates (correctly) on unaligned values, they have to be copied byte-by-byte (or at least, using a copy operation that is valid for unaligned values) to an aligned location or a register before operating on them.

3 Likes

They don't. That's precisely the issue. Look on u8: eigth zeros and ones. Now look on DDR5 DIMM: 288 connectors. And they are used in pairs.

That means that if your CPU asks for byte number 137 memory readily responds… and send bytes 128 to 144 back… sure, byte 137 is also there, but it would be nice to use the fact that you get so many bytes “for free”.

DIfferent CPUs and different systems have different “cache line” sizes (although today it's usually 16 bytes, but in the past we had 8 bytes, 4 bytes, even 2 bytes… very long ago).

The compiler would handle all of that. But this may require few additional instructions and manipulations of values in register. Which could be slow.

2 Likes

One other thing to keep in mind, if you're N bytes large and N bytes aligned, you won't cross the boundary of a larger alignment N*M. Alignment means you don't get values that need to have two larger aligned accesses to get the whole value. (This is why values have the same alignment as their size)

There's also atomic instructions, which must not partially update the bytes of a value when multiple CPU cores are modifying it at the same time: it's a lot easier to keep track of what core has changed what if they're not doing it for literally every byte.

2 Likes

Thanks for the clarification. From your answer, I conclude that there are instructions (or can be subject on any platforms) allowing one to read an arbitrary byte, but this is inefficient. Right?

I also understand that the specification limits pointers to have 2^n addresses and the implementation uses internal optimizations and/or special instructions for such pointers. Raw pointers &raw const/&raw mut and _unaligned() methods do not have such restriction, but use inefficient instructions and sew values on the fly, as you mention. Right?
This observation should not be 100% accurate and implemented exact in this way, but with similar logic.

Could you also clarify what kind of work the decoder does and why it may need an address that should be divided by 4? Suppose the decoder needs such address and obtain it by ignoring the last two bits. Why does decoder may need such an address? What decoder do with it? This is the part of hardware and does it important?

Looks like a pretty clean example: memory sends bytes back by blocks and It's desirable for type to be within the bounds of block to retrieve value with less operations.

What about the fact of 288 connectors (in pairs)? Why is it important?

This is just a remarkable definition in its conciseness, similar to the style of textbooks to which I am used to, but which I could not find on my own. Thanks a lot!

I got the impression that at the level of instructions is not about values ​​and instructions work with the blocks rather, this is not so? In any case, if we are talking about Rust structures alignment and size, then I think we should talk about instructions that allow one to atomically operate on the arbitrary 2^n-size block of memory in order to work with entire structure. Please correct me if I'm wrong.

For example pseudo-instruction GET_4_BYTES_BLOCK 0b001011 assumes 0b001011 be the 0b00101100 internally. That is last two bits are not used, what gives the possible advantages, which are not the topic of this discussion. What important that there are the set of such instructions and compiler make use of them. Is it something like this?

What about this type with size 4 ≠ alignment 2?

struct S{   // size  4, alignment  2, how does this look like?
    u: u8,  // e.g. S: |--|__|----|
    i: u16, //          u8     u16, '_' - padding
}

Atomic instructions are rather specialized and are required only to communicate across threads; in Rust they are provided by the types in std::sync::atomic - but they are mostly just a "do this normal thing but work properly when multiple threads are doing it"

Broadly, instructions can be rather general "copy these bits around", yes, but most instructions that actually "do things" are fairly direct representations of the basic math operators for each relevant type; 32-bit integer add, 64 bit floating point divide etc. There's also a few control flow instructions to support branches, loops, calling functions, etc., and together three groups are most of what a CPU does. (Not all - there's a bunch of "writing an operating system" instructions, and a lots of "do this thing but faster" added over time, for example).

But while an instruction might only move 4 bytes, that doesn't mean the actual hardware only moves 4 bytes. In practice they operate on registers (eg of 64 bits) and cache lines. That might be what you're thinking of.

It doesn't matter as much if a structure crosses a cache line, since each member it contains will still be aligned and operated on, but in some rare cases you may want to increase the alignment to avoid that, yes. Only when you can show that it actually increases performance, though; reducing memory utilization by adding padding is normally worse for performance.

1 Like

Another example is virtual memory. If you try to load u16 from an odd address, in a very unfortunate case the address happens to be the last address of a memory page. In this case 2 bytes of the u16 are stored in separate pages, means they're stored in 2 random-ish physical addresses, needs separate circuits to load 2 separate bytes. It's keep getting more complex as the byte size grows - with u32 you need to handle cases for load 4 bytes once, load 1 byte here and 3 bytes there, load 2 bytes here and 2 bytes there, load 3 bytes here and 1 bytes there just to handle unfortunate edge cases. It vastly make it complicated to design cpu circuts and make compiler optimization cost modeling near unpredictable.

4 Likes

Suppose you have a bunch of data stored in a spreadsheet. You could put all the data in the first column, but then you have to scroll a lot to find what you want. So you pick a number of columns that fits on the screen and you put data in all of them.

How would you pick what number makes sense? Suppose you pick 7. The cells are numbered starting at 0, so the first column contains cells 0-6, the second contains cells 7-13, the third cells 14-20, etc.

Now someone asks you "What's in cell 11923?" To know what row it's in, you divide by 7, which tells you it's in row 1703 (the remainder gives you the column).

Dividing by 7 isn't easy. What if you picked 10 instead? Now the first row contains cells 0-9, the second row 10-19, the third 20-29, etc. Cell 11923 is in row 1192, column 3 (counting both from 0). You didn't have to divide anything because the decimal system we use to write numbers already did the work for you.

Computers use binary, not decimal, so instead of powers of 10, the representation is based on powers of 2. That's why it's easy to divide by 2, 4, 8, 16, but not 3, 5, 7, etc.

4 Likes

I feel the need to emphasize that while this can get really complicated (this is a very deep well of technical detail!) - you can go pretty much forever not thinking about this. It is handy to know there is something to know if you happen to really need to get a bunch of performance, but honestly cache line concerns are pretty far down the list!

5 Likes

Math, control flow, what is the third? Or you means "branches, loops, calling functions"?

As I understand the third requirement here makes it possible to read any field using the same instruction requesting blocks of size equal to maximum alignment of structure's fields. And this is more efficient, rather than read with lesser alignment, which imposes the possibility of more than one instruction needed to read particular field + instructions to merge it into single field value.

  1. The alignment of the type is at least the maximum alignment of its fields.

The "moving bits around" group at the start of the paragraph. Sorry, I had some trouble grammering that part!

Any field is the important part there: instructions don't know about your structures, they only operate on the primitive types.

1 Like

It's not that “the last two bits are not used”. It's simpler: the last two bits simply don't exist.

Memory works with chunks larger than byte. Usually 16 bytes today, but in the past 8 bytes were common. Perhaps tomorrow we'll go to 32.

Operations with bytes are an illusion built on top of that. Sometimes it's even done in the CPU, not in the compiler, but that doesn't change the fact that it's still an illusion.

You couldn't read or write just one byte[1].

What's important is that creation of illusion that you work with 4 bytes is easier when you know they all are in one 16-byte chunk. All the other requirements go from that.


  1. most of the time, there are 8bit microcontrollers, but they are rare today even in embedded world. ↩︎

1 Like

Thanks for virtual memory example. With higher level of abstraction things tend to be more clear for me.

It seems that the analysis of 4 separate cases about which you are talking about in the context of virtual memory is not related to the design of the circuits, but to the writing of the operating system that implements virtual memory, which keeps the fact that such non-aligned accesses also complicate this process of virtual memory design. Right?