Why do 64byte cache lines matter?

  1. I understand that on x86_64, we have 16 general purpose 64-bit registers, and 16 xmm registers.

  2. I understand that the L1 cache is 4kb.

Is there some magical "L0 cache"?

If so, what is it / how does it work?

If not, why do we care about 64 byte cache line alignment?

The question here is: cache lines are 64 bytes, but general purpose registers are 64-bit (8 bytes). Where does the remaining 56 bytes go? thrown away (if so, sounds wasteful), cached somewhere? (existence of a 'L0' cache) ?

1 Like

Because it's the smallest unit that can be cached. So it doesn't matter if you acces 1 byte or 64 bytes, if it's a cache hit, they will be found equally fast. But it might matter if you access 64 vs 65 bytes, because the latter might be a lot slower if the 65th byte is not cached – the CPU will now need to pull in an additional cache line worth of data.

So it has nothing to do with how much total cache you have – it has to do with the minimum amount that can be manipulated at once.

The question here is: cache lines are 64 bytes, but general purpose registers are 64-bit (8 bytes). Where does the remaining 56 bytes go?

I don't understand what's being asked here. Why would you subtract the size of a register from the size of a cache line? They are two separate systems.

10 Likes

Cached where? Is there a cache that sits in between L1 cache and registers ?

L1 cache consists of contiguous pages of 4KB right?

So what is the cache that comes in units of 64 bytes?

Let N be a multiple of 4096.

Case A:
read N, N+8, N+16, N+32 into 4 different 64-bit registers.

Case B:
read N, N+64, N+128, N+192 into 4 different 64-bit registers

In both cases we read 8 * 4 = 32 bytes total.

In both cases, all reads are L1 cache hits.

Question: Is A and B equally fast, or is A faster because all 4 reads are from 1 64-byte cache line ? Now, if A is faster, is there some cache that sits between L1 and registers ?

No, L1 and L2 caches on x86 and x86_64 CPUs have cache entries / cache lines of just 64 bytes.

(At least some aarch64 CPUs, like the Apple M1, have 128-byte cache lines.)

Cache lines and pages are not the same thing. Cache lines relate to the physical architecture of the CPU caches, while pages relate to memory mapping and virtual memory.

8 Likes

The cache is larger than 64 bytes, but each individual item in the cache is 64 bytes. Whenever you read something you don't have in the cache, the entire cache line containing that thing is loaded to the cache, meaning that adjacent memory locations are also fast afterwards.

3 Likes

@alice @mbrubeck : caching - Line size of L1 and L2 caches - Stack Overflow agrees with you.

Thanks for the correction, I can't believe for all these years I thought L1/L2/L3 caches were in blocks of 4kb

1 Like

If cache lines / entries are 64 bytes. Then, outside of databases, where we have to flush to disk in 4kb page chunks at a time, is there any advantage to squeezing in-memory data structures into 4kb blocks?

Suppose we have a binary tree:

case A: all nodes are stuffed in some 8MB contiguous block of memory
case B: nodes are scattered all over the address space

Assuming that nothing in case B hits swap space, does case A offer any advantage? (If L1/L2/L3 caches were in blocks of 4KB, then having these nodes clustered together may be useful; but if it's 64 byte entries ...)

Something else I am confused about now is:

  1. for mapping over a sequence, why does Vector beat Linked List
  2. in the context of Entity component system - Wikipedia , why is struct-of-Array beneficial ?

In both cases, if L1/L2/L3 caches were in blocks of 4kb, the explanation would be: for linked list, you're likely getting all cache misses; for Vector, because it is contiguous in memory, after a cache miss, the next batch of reads is going to be in L1 cache.

However, since L1/L2/L3 cache entries are 64 bytes, this argument no longer holds.

Thus, in this world, what is the mechanism that makes Vector better than LinkedList when iterating through a sequence ?

This was due to a silly mis understanding on my part. My (incorrect) mental model was:

register is 8 bytes
L1 cache is blocks of 4kb (incorrect)
but when reading from L1 cache we read in lines of 64 bytes

so when we load a registers, we need 8 bytes, but we read 64 bytes -- where does the remaining 64 - 8 = 56 bytes go ?

Accessing a page for the first time can cause a page fault even if the page is not mapped to a file on disk. For example, a page that was newly requested from the OS by the allocator may be initially mapped to a copy-on-write page of all zeroes. Only when you write to the page does it get copied and mapped to a unique block of physical memory. So, at least when you are first allocating and initializing memory, you will see fewer page faults if you and/or your allocator can minimize the number of pages you write to.

(Also note that databases and swap files are not the only types of file-backed memory. Code and static data inside your program are also mapped from disk.)

When reading from memory, hardware or software prefetching may cause more than one adjacent cache line to be cached at once, which provides additional advantages for accessing data that is adjacent in memory.

In addition to the above effects, Vec<T> can fit multiple adjacent elements in a single cache line as long as T has size and alignment less than 64 (or 128 on some architectures). For example, iterating over a Vec<i32> will cause a cache miss once every 16 elements (compared to LinkedList<i32> which may cause a cache miss on every element).

8 Likes

Adding to this that a CPU can prefetch the next cache line if it suspects you're going to need it.
This feature is not that useful for linked list, but might still prove useful if you use a vector of 64B items.

1 Like

Prefetching is just icing on the cake. The biggest problem of lists are what they do to pipelining.
Once upon time, when computers were big yet had very few transistors execution of command took many, many, CPU ticks.
This is because it's not that easy to execute an instruction: first you need to read it, then you need to decode it, read it's arguments, execute it, then store results.
If you make transistors smaller (and today they are tiny, indeed) you can turn these steps into a separate blocks.
Today there are typically 15-20 such blocks in sequence and they are duplicated 4 times (Pentium 4 tried to make almost 40 blocks but that turned out to be a mistake) — so you may execute about 60-80 (sic!) instructions simultaneously (but only specially crafted programs achieve that in practice). But this only works well with arrays because you can calculate address of 2nd, 3rd, 4th element of array before you finished processing the first element.
With list it's impossible: before you would know where 2nd node of list is placed you have to fully finish working with first one. Instead of 40 instructions you execute 5 or 10 instructions.
Also: while CPU can execute 4 instructions simultaneously usually only one or two of these can access memory. With vector compiler often can optimize execution and skip some read or writes. With list it's usually not possible.

4 Likes

If you are the kind of person who watches videos of people talking about programming topics, and you can suffer through a bit of C++*, I can recommend Mike Acton: "Data Oriented Design and C++" He begins speaking about cache at about the 30 minute mark but the entire thing is worth watching in my opinion.

* You don't need to understand C++, although it probably helps

4 Likes

As I know, it is mostly defined not by CPU caches, but by the memory subsystem: how DDR3/DDR4, external buses, PCIe mapped devices, and memory controller works. Burst read/write into memory with larger blocks has fewer drawbacks, so 64-byte blocks combined with caches look the most effective not blocking memory buses at the same time.

At Apache Arrow we have an allocator that aligns with 64 bytes. I ran

pub fn random_sum(data: &[u64], indices: &[u64]) -> u64 {
    indices.iter().map(|x| data[*x as usize]).sum()
}

where data is allocated aligned (arrow2::buffer::MutableBuffer) and unaligned (std::Vec), and there is no difference between the two.

I also tried

const LANES: usize = 8;

pub fn sum(values: &[u64]) -> u64 {
    let chunks = values.chunks_exact(LANES);
    let remainder = chunks.remainder();

    let sum = chunks.fold([0; LANES], |mut acc, chunk| {
        let chunk: [u64; LANES] = chunk.try_into().unwrap();
        for i in 0..LANES {
            acc[i] += chunk[i];
        }
        acc
    });

    let remainder: u64 = remainder.iter().copied().sum();

    let mut reduced = 0;
    for i in 0..LANES {
        reduced += sum[i];
    }
    reduced + remainder
}

with equivalent results from 2^10 to 2^25 entries.

Yes. Off the top of my head, several reasons the page size matters:

  • There's a less-discussed cache in the CPU called the TLB. It caches the mappings of virtual memory pages into physical memory pages. I've worked on systems where 15+% of the CPU time was wasted waiting for TLB misses. To address that, the fewer pages your program accesses, the better. Pages are often 4 KiB on x86-64. Although Linux can use "huge pages" of 2 MiB or even 1 GiB, and obviously taking advantage of this can be much more effective (reducing page counts 512X or even 262,144X) than just shrinking data structures a little bit.
  • When the kernel has to page memory out or in (to/from SSD/disk or compressed RAM), it does so in units of 4 KiB (maybe more, see the vm.page-cluster sysctl on linux). If the system is under enough memory pressure to do much paging (bad news anyway), poor page-level locality can make it even worse.
  • Similarly, a page is the minimum unit of physical memory allocation. Touching 1 byte in a fresh page costs at least 4 KiB of actual RAM.
  • It's also the minimum granularity for controlling memory protection.
3 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.