First, most SIMD implementations are quite picky about alignment, and will only perform at their maximum speed if vector data is loaded contiguously from a RAM address that is a power of 2. A practical consequence is that 3D vectors must be expanded into 4D vectors even if you don’t need them, leading to a waste of memory bandwidth and vector lanes.
For example, here’s how you would load two 3D position vectors into a vector register of size 8:
| X1 | Y1 | Z1 | ## | X2 | Y2 | Z2 | ## | (## is padding)
With one array per coordinate, you can fill 3 vector registers perfectly instead:
| X1 | X2 | X3 | X4 | X5 | X6 | X7 | X8 |
| Y1 | Y2 | Y3 | Y4 | Y5 | Y6 | Y7 | Y8 |
| Z1 | Z2 | Z3 | Z4 | Z5 | Z6 | Z7 | Z8 |
This first concern is specific to 3D vectors, 2D and 4D vectors will be fine.
Then, when it comes to actually doing stuff with the vector, you will find that most “scalar algorithms” will transparently translate into a vectorized version when using array-per-coordinate storage, whereas the array-of-vector storage will often require a specific, and sometimes less performant algorithm.
For example, let’s assume that you want to compute the norm of 3D vectors:
n = sqrt(x² + y² + z²).
With array-per-coordinate storage, the implementation is very simple:
Load X in R1: | X1 | X2 | X3 | X4 | X5 | X6 | X7 | X8 |
Square R1: | X1² | X2² | X3² | X4² | X5² | X6² | X7² | X8² |
Load Y in R2: | Y1 | Y2 | Y3 | Y4 | Y5 | Y6 | Y7 | Y8 |
Square R2: | Y1² | Y2² | Y3² | Y4² | Y5² | Y6² | Y7² | Y8² |
Add R2 to R1: | X1² + Y1² | X2² + Y2² | X3² + Y3² | X4² + Y4² | X5² + Y5² | X6² + Y6² | X7² + Y7² | X8² + Y8² |
Load Z in R2: | Z1 | Z2 | Z3 | Z4 | Z5 | Z6 | Z7 | Z8 |
Square R2: | Z1² | Z2² | Z3² | Z4² | Z5² | Z6² | Z7² | Z8² |
Add R2 to R1: | N1² | N2² | N3² | N4² | N5² | N6² | N7² | N8² |
Square root R1: | N1 | N2 | N3 | N4 | N5 | N6 | N7 | N8 |
That’s pretty much the algorithm that you’d have written for the scalar algorithm in assembly. It takes 3 vector loads, 5 arithmetic operations and 1 square root to do the computation on 8 three-dimensional vectors, and it’s likely to be optimal on pretty much every architecture (save for possible use of FMA, when it’s available and its different precision characteristics do not matter). Now, let’s see what kind of algorithm we would need when using the array-of-vector layout:
Load vectors: | X1 | Y1 | Z1 | ## | X2 | Y2 | Z2 | ## |
Set padding to 0: | X1 | Y1 | Z1 | 0 | X2 | Y2 | Z2 | 0 |
Square: | X1² | Y1² | Z1² | 0 | X2² | Y2² | Z2² | 0 |
Add pairwise and pack: | X1² + Y1² | Z1² | X2² + Y2² | Z2² | $$ | $$ | $$ | $$ | ($$ = irrelevant)
Add pairwise and pack: | N1² | N2² | $$ | $$ | $$ | $$ | $$ | $$ |
Square root: | N1 | N2 | $$ | $$ | $$ | $$ | $$ | $$ |
Here, we used 1 vector load, 1 masked operation, 1 arithmetic operation, 2 special-purpose operations, and 1 square root, in order to compute the norm of two vectors. So for 8 vectors, we would need 4 vector loads, 4 masked operations, 4 arithmetic operations, 8 special-purpose operations, and 4 square roots, unless we would be ready to unroll the resulting loop and get clever with the pairwise adds and square roots (it’s quite intuitive that given enough inter-register data shuffling, we can get down to 1 single square root).
Moreover, the special-purpose operations may not be available on all SIMD architectures, and may have much lower throughput than basic arithmetic SIMD operations. For example, if I remember correctly, Intel’s SIMD instruction set does not actually have a “horizontal add and pack” operation for floats (although it has one for integers, because, well, who needs consistency?), so you would need to build it yourself using a combination of other primitives.
Now, I don’t have very high faith in my SIMD coding skills, and this may not be the best possible algorithm. For example, depending on the architecture, zeroing the vector then doing a masked load may be better than loading and then doing a masked assignment. But you can already see that this approach requires more CPU instructions, depends more on the characteristics of the underlying SIMD instruction set and implementation, and uses a much more situational approach. It is thus much less likely to be automatically generated from a scalar loop “for free” by an optimizing compiler.
If you want to have some fun, you may now ponder how you would implement a vector cross product in the array-of-vector approach. Whereas it is again as easy as a scalar implementation in the array-of-coordinate approach…