Ah yes.
That Intel loop-blocking discussion is perhaps an extreme case.
Typically if you want to process elements of a big 2D array it's better to work through the rows and columns in such a way that you are always steeping up in memory address and your cache always has the data you need on hand.
But that Intel example is sneaky. It accesses one array with (i, j) and the other with (j, i).
Clearly walking up memory for one is optimal, but it is a disaster for the other as it has to jump around memory trying to access the correct row and column.
loop-blocking is then a compromise to try and keep the caches useful for accessing both arrays at the same time. One has to tune the block size to suite ones cache size.
Such loop blocking is likely very sub-optimal for traversing a single array. Or even two arrays in the normal order.