Memory alignment for vectorized code

In C and Rust we can rely on the compiler to allocate memory in a layout that "works" for the type and platform at hand. Works means the starting point, i.e., the memory to which the pointer refers, starts at the same location the CPU architecture expects without having to perform shift operations. Works also means, that in context of a collection, the pointer's stride will land similarly when advanced. Per an example I took note of a while back:

Note: This example is not intending to represent how Rust (or other) might layout the memory per se, but rather just a sketch to "level-set" my understanding of the benefit of, and requirements for, aligned data.

struct {
  a: char;
  b: i32; 
  c: i16; 
}

the compiler will layout an instance on a 32-bit machine in a way that aligns with the platform's memory layout (it's stride):

 -----
 | a | 0x0000
 |   |
 |   |
 |   |
 -----
 | b | 0x0004
 | b |
 | b |
 | b |
 -----
 | c | 0x0008
 | c |
 -----

But say we transferred the structure over a network that compressed the data:

 -----
 | a | 0x0000
 -----
 | b | 0x0001 🚫 
 | b |
 | b |
 | b |
 -----
 | c | 0x0005 🚫
 | c |
 -----

This memory is no longer aligned. In order to read b there are now several steps:

  • go to 0x0000 (the memory tables don't include 0x0001)
  • read 0x0000 into an output register (=> abbb)
  • shift left 1 byte (abbb => bbb0)
  • read 0x0004 into a tmp register (=> bccx)
  • shift right 3 bytes (bccx => 000b)
  • combine with bitwise or: bbb0 | 000b => bbbb

When I have allocated memory for a concrete instance of a slice &[T], the compiler will layout the memory so it "works" (at the expense of wasting bits to ensure the stride lands aligned with what needs to be read without needing to perform shifts).

The question: What is it about using vectors (in SIMD code) that might somehow

  1. "corrupt" the sequential layout
  2. put into doubt that the starting position of the pointer is not aligned with what the platform expects (to read without shifts)

Scenarios
Vertical computations (computations that merge lanes from different vectors)?
aligned, aligned -> aligned

Horizontal computations
aligned first element, aligned second element... -> aligned result in the first element position

If I take bytes from a buffer fed by a file or stream
ptr = &[0..] aligned -> ptr = &[n..] aligned

Finally, if I ask Rust to allocate for Vec<MyU8>, on a 64-bit machine (memory layout), will it use 64-bits for each of the MyU8 (8-bits)? To minimize "the waste", in this case I could create another type with 8 x MyU8 as the element/item in my Vec; is that required? But nothing that I can imagine so far throws off or changes the value of ptr = &[0..].

Thank you in advance to anyone that can help clarify the alignment "gotcha" when working with vectorized code.

- E

1 Like

To clarify: How are you defining the type MyU8?

1 Like

Thank you @mbrubeck The type would have 8-bits of information and be sufficiently typed (if you will) for the compiler to index the array such that each unit/element starts on aligned memory.

struct MyU8(u8);

Would each element in the following consume 64-bits on a 64-bit platform (so total 64x10 bits)? My guess would be yes to ensure alignment when incrementing the pointer along the Index.

[MyU8; 10] 

By default, that type will have 1-byte alignment, and [MyU8; 10] will be 10 bytes wide. But if you use a #[repr] attribute to modify its alignment:

#[repr(align(8))]
struct MyU8(u8);

Then the type [MyU8; 10] will be 80 bytes (640 bits) wide: Playground.

2 Likes

Wow. So given that MyU8 has a 1-byte alignment by default, how does a 64-bit platform not have to perform shifts "under the hood" to access, for instance, the second element?

Most modern ISAs (instruction set architectures) provide direct byte accessing of data. They do so because so many communication and storage protocols pack data as sequences of bytes. On such platforms memory access efficiency is provided by multi-level memory cache hierarchies.

1 Like

Thank you. That helps explain a few things. So, assuming the multi-level cache approach addresses any performance hits, what makes alignment such a concern by many credible teams (pack_simd, stdsimd and others), so meaningful? The Intel and journal articles also go on about the issue; non-trivial hits to performance. And to my original question what in particular is the cause of misaligned data inside the bounds/scope of what you have described?

SIMD instructions have higher alignment requirements than scalar instructions—typically 16-byte (128-bit) or higher. So while non-SIMD instructions are able to operate on a [u32] slice with only 4-byte alignment, the same slice might not be aligned correctly when treated as a SIMD vector.

Misalignment can also happen when doing things like raw pointer casts to reintepret the same block of memory as a different type. For example, you might read some data from the network or filesystem into a [u8] buffer (1-byte alignment). If you then use a pointer cast to treat this as a [u16] slice (2-byte alignment), then it might cause unaligned reads.

Rust provides align_to as a helper function for use cases like "I have a vector of u16s but I want to use SIMD to the extent possible". If you like computer math you'll probably also like the making-of blog post.

2 Likes

A couple of observations on this:

  • char is 4 bytes in rust because it represents an Unicode scalar value, not 1 byte representing an ASCII character as in C/C++
  • The compiler is not guaranteed to produce this layout, you need #[repr(C)] for guarantees on how the struct will be layed out in memory. For example if a was a u8 the current compiler (note that this is unspecified behaviour, don't rely on it) would move it at the end of the struct, reducing the need of padding bytes and bringing the struct size down to just 8 bytes (vs the 12 bytes if you used #[repr(C)])
4 Likes

Becase "alignment" means exactly that. If a type T is aligned to N bytes, it means that it can only be read (efficiently or at all) from addresses which are multiples of N.

I think you are confusing size and alignment. It would be possible to imagine a 64-bit platform that only allowed reading from addresses of the form 8 * N. However, in this case, we would say that bytes have a size of 1 but an alignment of 8.

In general, on most platforms Rust runs on, "simple" primitive types have a size of N and an alignment of N as well, so bytes can be read from any address, u16 and i16 can be read from even addresses, etc.

1 Like

Perfect.
:white_check_mark: Conclusion: when using vectors, I need to consider memory alignment for the vector (as a whole) as I would an element in an array if I want to process a collection of vectors. This requires manual signaling to the compiler to treat it (vector) accordingly.

:white_check_mark: I can see how reinterpreting the memory within a vector needs care; furthermore, that reading elements within a vector is not going to be optimized (because it's optimized for the vector).

:white_large_square: where I'm unclear is how it's possible to lose the aligned attribution ptr aligned -> ptr misaligned of the first element in the first vector. The address to which the zero index borrow for &[T] should be the same value as that (address) in the return value in &[T] -> &[Vector].

Question: Can the mere act of having the type system treat a series of bits one way to another, by definition, invalidate the alignment guarantee of the start address as described? i.e., the invariant must subsequently be checked.

The so what:
In the scenario where the starting address invariant holds, then a math exists to ensure that changes in representation be done in a way that remain optimal for the vector unit of computation (that includes loading, transforming etc..). As long as the stride that matchs the vector remains constant, I'm good to go. I can safely compute "unchecked" as long as these assertions hold.

In the scenario where somehow the initial pointer gets misaligned, we can expect to have to adjust (shift) the memory with every load of a vector. Otherwise, there are intrinsics that process unaligned memory, but at a performance cost.

The information here is helping me to pinpoint how to avoid this scenario and otherwise know when it's ok to safely compute with "unchecked" memory.

Thank you @SkiFire13. I added a note to the original post to qualify the scope of what I'm representing. Your point and clarification is well-taken. I don't want to generate "noise"; simplify yes, generate noise, no. The details of how #[repr(C)] are also helpful for the implementation of what I'm working on. So thank you on both counts.

The problem is if you convert from &[T] to &[U] where U has a higher alignment than T. The starting address for the &[T] slice is guaranteed to be properly aligned for T. But the same adress is not necessarily properly aligned for U.

2 Likes

The explicit distinction between size and alignment is a solid contribution to rectifying and describing how I'm thinking about it.

:white_check_mark: At the level of the compiler, think in terms of "size of N and an alignment of N as well". And, that relationship is what is meant by "aligned memory".

Where I get "pulled back in"... to the mud, is in having to think about the size of the registers when working with SIMD and the platform-dependent intrinsics. However, both you @H2CO3 and @mbrubeck have stated that Rust will align the data without having to resort to padding (the "wasted" bits in the original sketch) to accommodate the platform. @TomP suggested why the padding is not required to compensate for the inherent stride of the platform. This simplifies the issue a lot.

Got it: a starting address of 0x1004 set for T of size 32-bit, is aligned. But if I cast [u32; 2] to u64 the address is no longer aligned. (So the multi-level cache mechanism can't just start from 0x1004 and perform 64-bit strides instead? mostly rhetorical but good for drawing the picture).

So, if I'm understanding correctly, I can avoid using misaligned memory by first initializing the memory with the larger stride (size of the vector). As long as the vector stride fits some number of T I can subsequently cast the memory to a collection of T and be assured the memory remains aligned. Thus, the design need only anticipate the use of the larger vector to initialize the memory accordingly. As long as I keep using that memory, and cast only to a T, where the size of n x T = the size of the vector, it's safe and consumable by intrinsics that require aligned memory. If this is correct, I have my answer.

Yes, this is correct.

Thank you, to all. I'm blown away by your contributions and patience!

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.