How to deal with uncached accesses? (volatile is NOT uncached)

As much as I hate this, most C compilers (including Clang/LLVM) allow caching of "volatile" variables. This means that whatever you declare as "volatile" will easily end up in your L1 cache. And when you re-read it, the underlying HW will happily give you the value from L1, even though it might have changed in the external memory already (if it's a memory-mapped register, or another core writing to it, or a DMA controller, etc). The only thing that can help you in this case is a hardware cache coherency machinery, but many embedded systems don't have it, and even if they have it - it may not cover DMA and memory-mapped periphery.

Going back to Rust. What I have is a processor with L1 data cache, a DMA controller, and some external memory. The DMA controller does not go through processor's L1 cache. DMA descriptors and some other auxiliary DMA data structures are located in the external memory - written by the processor, read by the DMA controller, and sometimes written by the DMA controller too.

The ISA of my processor has uncached load and uncached store instructions, and also allows me to flush and invalidate cache lines individually. The question I have is how to do this in Rust reliably. Reliability is key, I know many ugly ways to do this in C, and most of those ways sooner or later caused me and my colleagues and my clients some real-world cache coherency issues which took MONTHS to debug.

I would appreciate if someone could share the approaches you are using to make it reliable. Here are my thoughts:

  1. The problem with flushing/invalidating cache lines is that if data structures are small enough to share the same line, the processor can change one of them while the DMA controller can change the other, and data corruption is inevitable. This would only work for data structures whose size is a multiple of a cache line size, OR for data structures which are properly aligned and padded to ensure that nothing else can end up in the same line (but it won't work for arrays with layout is dictated by HW). If these constraints can be satisfied, I can create something similar to read_volatile() and write_volatile() functions from core::ptr. But can I guarantee the constraints at compile time? I don't know.

  2. I can use inline assembly to perform word-sized uncached loads and stores, and read/write larger structures word-by-word (although this is slow in terms of HW speed, but at least would work with uncached arrays). But I need to ensure that such variables are not accidentally cached, because this will also cause data corruption (e.g. if u32 at address 0x0 is accessed as uncached, and i32 at address 0x4 is accessed as usual, and you access the i32 first, both addresses 0x0 and 0x4 get into L1 because they belong to the same cache line; later you can do an uncached write of 0xdeadbeef to u32, so that its copy in L1 is not coherent anymore, and finally you can do a normal write to i32, making the line dirty. It is likely that sooner or later this dirty line will get evicted, overwriting your 0xdeadbeef with the initial value of u32). So perhaps it can be solved by putting all uncached variables in a separate linker section, but what if I forget to do so? This needs to be detectable at compile time too. And what if I access some of the variables in this section the usual way, so they get cached, and I'm back to square one?

I need a reliable solution for at least Item 2, and ideally for both 1 and 2. For now my only option is to disable D-cache :smiley:

Thanks

If C's volatile reads aren't good enough, then Rust's won't be either, because while the exact semantics are still up in the air, they will likely be the same as C11's.

You might be able to make atomics work- they, among other things, have global memory ordering and visibility constraints, meaning that anything operated on atomically can't go into any per-core cache. They're not necessarily truly uncached though, as storing them in a shared cache (L3 in most modern CPUs) is fine, and under certain circumstances, they can be elided[1].

If that isn't good enough, writing a type-safe wrapper around inline assembly is the only way to go.


  1. If the compiler can prove that any atomic effects are entirely local they can be elided, but the way things are typically done with atomics that's very rarely the case. ↩ī¸Ž

4 Likes

To bypass the cache you need platform-specific mechanisms such as marking the physical memory access ranges as uncached or using cache-bypassing instructions.

The ISA of my processor has uncached load and uncached store instructions,

You'll have to write wrappers for those.

Volatile only promises the compiler won't elide or reorder the access, it doesn't say anything about what the hardware will do, so yeah, they're generally pretty useless. Atomic is better, but only promises visibility with other accesses by your program, not what the hardware sees.

The closest language provided options to what you're asking for are either assembly, as mentioned, or possibly intrinsics or arch intrinsics (more likely the latter)

1 Like

That's not actually true, atomics often do move stuff into the L1 cache of whatever core you're running those atomic operations on, atomics don't work by preventing stuff from being cached but by ensuring no other core can read/write from the same location while the atomic is executing, often by moving the memory the variable is in into the L1 cache and making other cores wait until the atomic is complete before moving the memory to the other core's L1 cache.

The above applies to atomic operations that write to memory, atomic loads usually copy the variable to the L1 caches of all cores that access the variable, so all cores can quickly read the variable without having to access other cores or caches. The reason they can get away with that is because they mark that cache line as being in the shared state, and any core that attempts to write must first get all other cores to invalidate their copies of the variable.