I'm interested in data-oriented design and I'm devoting a lot of the time I spend on personal projects to micro-optimization.
Something I'm working on now requires reading 4 chunks of a chunked 2D grid into the cache at a time. The chunks are bigger than a single cache line, but if I can prefetch the entirety of the contents of each of the 4 chunks, I don't think I'll have any cache misses on the operations that work within elements of that subgrid.
I see some interesting intrinsics in nightly, though the documentation isn't yet entirely there for a non-expert to get started with.
What I most want to do is just provide a range of memory to fetch all at once. The memory is laid out linearly inside each chunk, but not necessarily from one chunk to the next. So I would like to provide a range for each chunk as I glide across the overall 2D grid and work on 4-chunk blocks at a time.
I welcome answers to my particular question, suggestions that I'm going about this the wrong way entirely, or general books I should read for more background in this area.