Compiler fence + DMA

I'm trying to follow the discussion in ## Pre-Pre-RFC: `core::arch::{load, store}` and stricter volatile semantics · Issue #321 · rust-lang/unsafe-code-guidelines · GitHub from the perspective of a Rust user - not a compiler contributor. And to be honest it is plain over my head it seems.

More specifically I gave a concrete example in some comment to that issue. I understand that the issue is not the right place to ask for clarification by someone who doesn't have the necessary compiler background to follow the discussion properly. So I hope for some help from this forum.

Repeating the example I gave here for convenience

a) allocate some largish zero-copy buffer - possibly on the stack or statically (array of bytes, say a buffer that can hold a full IP packet including driver headroom/tailroom).
b) write to buffer across application code, network layers and libraries (e.g. application, rtos, smoltcp, soc-specific driver).
c) save pointer to the first byte of the buffer cast to u32 to MMIO register (volatile write)
d) wait for DMA to finish (interrupt)
e) deallocate or re-use buffer

And the inverse for an inbound packet:

a) allocate buffer
c) save pointer as u32 to MMIO register (volatile write)
d) wait for DMA to finish
b) parse buffer across application and network layers
e) de-allocate buffer

The typical solution for that in the embedded ecosystem seems to be to place a compiler fences or hardware/memory fences around the DMA access to synchronize with prior/subsequent memory accesses to the DMA buffer.

However, the issue I'm pointing to, seems to insinuate that this is not the proper way of synchronizing access, at least not after the change proposed in that issue. They seem to propose a different approach with macros, volatile accesses, assembly with proper clobbers, fences, etc.

But the abbreviated form in which those solutions are proposed, assuming compiler and/or language knowledge I don't have, is not accessible to me. Can anyone try to explain this in language that assumes less prior knowledge? I'd love to transform their comments into real code in the context of drivers similar to the ones linked above as examples.

Note: I am aware of and comfortable with the C20 memory model (Acquire, Release, ...) and I brought up the issue in the first place because the current use of fences seems to be invalid in the cited sources. But I'm not acquainted in detail with the fine semantical details of bringing assembly with specifically crafted "clobbers" and macros into the picture.

UPDATE: I marked the proposed future official Rust approach to DMA as solution. But be aware that current drivers need to rely on fences as @diondokter has pointed out. So his is to be considered the "inofficial" solution for now.

3 Likes

To make my assumptions more explicit, I'd assume that the following piece of code should do the right thing. Currently it seems to do so in practice although volatile is not officially related to the memory concurrency model. And in the future with volatile becoming officially "atomic" it would do so explicitly:

Outbound:

use core::sync::atomic::{compiler_fence, Ordering};
use core::ptr::write_volatile;

static mut DMA_START_REGISTER: *mut u32 = 0xDEADBEEF as *mut u32;

fn send_packet(buffer: &mut [u8]) {
    buffer[0] = 0x42;
    buffer[1] = 0x24;
    // ... fill in the buffer

    compiler_fence(Ordering::Release);

    let ptr = buffer.as_ptr() as u32;
    unsafe {
        write_volatile(DMA_START_REGISTER, ptr);
    }

    // DMA hardware begins transferring the buffer...
}

Inbound:

fn receive_packet(buffer: &mut [u8]) {
    let ptr = buffer.as_mut_ptr() as u32;
    unsafe {
        write_volatile(DMA_START_REGISTER, ptr);
    }

    // Wait for DMA completion (interrupt, polling, etc.)...

    compiler_fence(Ordering::Acquire);
    let header = buffer[0];
    let payload = &buffer[1..];
    // process payload...
}

Assuming the MMIO register is ARM strongly ordered memory, a hardware/memory fence or DMB instruction doesn't seem to be required.

I don't see the need for any macros or assembly here, tbh. Nor would I say that references must be avoided here.

In my understanding, working with DMA can be highly platform dependent. You have to consult with your target platform docs to understand what is required for proper memory synchronization. In some cases (e.g. on simple embedded devices) simply observing the interrupt is sufficient to observe the DMA memory, in others you have to execute a memory synchronization instruction, in the worst case you may have to explicitly flush CPU caches (see "noncoherent DMA").

DMA is currently outside of the language model, so your best bet is to perform the necessary synchronization inside an asm! block. It would act as a "black box" for the compiler and you fully control instruction sequence which performs the memory synchronization.

Agreed. But my question is not about the platform specifics but about telling the compiler not to mess with my buffer after I start DMA and before I end DMA. This must be done in a platform-independent way by definition as Rust's abstract machine / memory concurrency model doesn't know anything about the platform.

The details of actual DMA after releasing the buffer or before acquiring it may differ of course and even require platform-specific assembly instructions. That's a black box to the compiler.

You can do it by passing pointer to the buffer to the asm! block which initiates DMA or receiving it from asm! which handles notification of DMA completion. This is what I meant by "it would act as a "black box"". Since asm! is opaque for the compiler, it can not make any assumptions about what happens with the pointed memory.

I am not sure whether write_volatile(DMA_START_REGISTER, ptr) is sufficient as a "black box" for ptr.

In the above real world example with an nRF52840 a asm!() statement is not required. Do you propose to insert a "fake" (empty) asm! statement refering to a pointer to the buffer to make the compiler assume the right thing? That doesn't seem very ergonomic.

Even inserting asm!("dmb") - if at all required - would not require a pointer argument. And of course I'd still like to be able to use PAC abstractions of registers (including those writing pointers) in client code without having to resort to assembly.

Hm, that's the question I'm trying to answer. That's how I understand that RFC: compiler_fence + volatile access will in the future officially provide ordering guarantees including for "normal" memory accesses prior to (release) or after (acquire) the fence. Assembly only required (with or without reference to DMA memory) if actually required by the platform. See Ralph's response. I'm just seeking confirmation that this understanding is correct with a specific example as recommended in that thread.

I propose to replace write_volatile with asm! which does the same. On the completion side, if you don't need any hardware memory synchronization, asm! may be indeed empty and act only as a "black box" for the compiler. compiler_fence may be sufficient in the latter case, but personally I am not confident enough to bet on it.

Yes, it's a bit unergonomic, but in my opinion it's currently the most reliable solution until the language model is sufficiently developed to properly and unambiguously cover this area (though I highly doubt that noncoherent DMA will be covered by it in the near/mid-term future). And as an embedded developer, a tiny asm! snippet should not scare you.

Nothing prevents you from passing an unused argument to asm! and return it without modifying. (You may need to use the // {} trick here to silence the compiler complaints.)

Then we have to summon @RalfJung here. :slight_smile:

In a hypothetical future where we define volatile operations to be atomic, a compiler fence is not sufficient for this, but an inline asm block is.

Yeah, that's fine. A release fence also does not require a pointer argument, and the compiler has to make release fences work.

So let's replace the volatile by atomic and ignore the DMA part for a second, to understand the basic pattern we are using:

    buffer[0] = 0x42;
    buffer[1] = 0x24;
    // ... fill in the buffer

    std::atomic::fence(Ordering::Release);

    let ptr = buffer.as_ptr() as u32;
    unsafe {
        relaxed_atomic_write(DMA_START_REGISTER, ptr);
    }

Here, some other thread could spin until it sees the write to DMA_START_REGISTER, do an acquire fence, and subsequently do normal non-atomic loads from the buffer. This is just standard intra-process concurrency, nothing weird here.

If you replace the atomic write by a volatile atomic write, then:

  • the basic pattern also works with observers outside the AM, such as a DMA device
  • however, if your hardware needs a specific instruction for the fence, then you need to replace std::atomic::fence(Ordering::Release) by an inline asm block emitting that instruction. If your hardware needs no instruction, you can use a regular fence or an empty inline asm block.

It is basically never correct to use a compiler_fence; see the docs for the very few cases where that is the operation you want. (Yes that operation is terribly misnamed, I assume the people who added it were not aware of how little it actually does.)

Why does an empty inline asm block work? Well it'd take me hours to write down a full explanation of asm blocks unfortunately, but the short version is that you get to make up a story for what an inline asm block does in terms of Rust. See here for a slightly longer but still terse version. One day I'll have the time to write that blog post...

What is "ARM strongly ordered memory"? Does it guarantee that concurrent interactions on that memory work similar to x86 TSO, without needing any fences?

But then why does Compiler Explorer not work. I feel so silly, sorry. I seem to be missing some central point.

Inserting an acquire fence before returning buf[0] makes the compiler correctly re-load the value. At least in current Rust.

Because you told it that nobody else accesses this memory -- no other thread, and no external hardware. That's what &mut means. Mutable references in Rust are unique, no aliases permitted. This includes aliases that exist in DMA devices -- they are forbidden. Therefore, LLVM can assume that the buffer didn't change in your example.

If you use a raw pointer instead (https://godbolt.org/z/YToco4rva), then I see the assembly change. No idea if that's the right change though. :wink:

Inserting a asm!("dmb"); there also has that effect, no? I can't read assembly so I don't know what I am looking at.

But either way, this is an accident, not a guarantee. Any other thread (including hardware via DMA) mutating that memory would race with the buf[0] read, and such data races are UB. So even the raw pointer version could be optimized to re-use the previous read.

Unfortunately, it's not. See the above mentioned noncoherent DMA. IIUC std::atomic::fence works only for memory synchronization between CPU cores, synchronization with hardware may require additional instructions.

What about the DMA receiver side? We can not use fence(Acquire) since there is fence(Release) paired to it. Am I correct that asm!("// {}", inout(reg) dma_ptr) is the only tool we can use here?

I was talking about my example. My example is standard intra-process concurrency. We have to all understand my example and why it works first before we can talk about the DMA case.

I know. I said exactly that in the thread that has been referenced above. Please read this. :slight_smile:

"For (2), one has to put a suitable fence between the DMA accesses and the final MMIO access. This will, at least for now, require inline assembly: RISC-V, for instance, has separate fences for communicating with regular cached memory and with MMIO memory. The logical justification of the inline assembly block will be that it forks off a thread that asynchronously copies the DMA memory, modeling what the hardware does. So there's no new API required here from our side."

It does not. dma_incorrectly_cached in the godbolt example does that and the compiler still hardcodes the return value of 1:


        movs    r0, #1
        dmb     sy
        dmb     sy
        pop     {r7, pc}

My understanding is it moves a literal 1 to the return register (r0) and then does the 2 dmb's, hence it "caches" the value / hardcodes it. The pop acts as a ret since it pops the program counter.

1 Like

Compiler Explorer doesn't have a dmb before returning buf[0], it has it before even doing the write. @fg-cfh was talking about some other place where they wanted to put the fence. I think. They just used English instead of a link to an example so we can only guess.

As Ralf already pointed, you're missing the part where you tell the device the DMA address in your example. Maybe you can try adding that and see.

You should use asm!("// {}", in(reg) buf.as_mut_ptr());.

That should be similar as asm!("") + write_volatile().

1 Like

I was looking at Compiler Explorer from the github issue, I assumed the one linked here was the same, apologies for my confusion.

That one has the following implementation:

#[no_mangle]
pub unsafe fn dma_incorrectly_cached(buf: &mut [u8;3]) -> u8 {
    buf[0] = 1;
    asm!("dmb");
    // do DMA
    asm!("dmb");
    buf[0]
}

Even changing it to:

#[no_mangle]
pub unsafe fn dma_incorrectly_cached_many_dmbs(buf: &mut [u8;3]) -> u8 {
    asm!("dmb");
    buf[0] = 1;
    asm!("dmb");
    // do DMA
    asm!("dmb");
    let ret = buf[0];
    asm!("dmb");
    ret
}

does not prevent the compiler from assuming the buffer is unchanged:

        movs    r0, #1
        dmb     sy
        dmb     sy
        dmb     sy
        pop     {r7, pc}

No, that might change the output but doesn't fix anything. As I explained above, if hardware mutated buf[0], there'd be a data race with the buf[0] load at the end of the function, causing UB. Therefore LLVM may assume that buf[0] does not get written to.

And conversely, passing the pointer to the asm is entirely unnecessary when doing everything else right. After all, atomic::fence also does not take a pointer argument.

So, what you suggest is never the right answer.

1 Like

Yes, you have an &mut reference here that you are not sharing with anyone. Those can't have aliases, so it is impossible for any other thread to read or write this memory while the function runs (or else UB). I explained this above.

Everybody please slow down and remember the basic sources of UB in Rust, which include aliasing violations and data races. Just because you talk to hardware doesn't mean you get out of UB-jail for free.

1 Like