How to correctly use asm! as memory barrier (LLVM question)

This is in some respect a follow-up to Compiler fence + DMA - just focused on the actual inner workings of asm!(). To strengthen my intuition about asm! as a memory barrier, I was playing around with it for a bit longer.

It seems that the analogy to atomic fences from the cited topic cannot fully explain the behavior of the following sample program - see the inline comments:

#![no_std]
use core::arch::asm;

#[unsafe(no_mangle)]
pub unsafe fn do_dma(dma_ptr: *mut *mut u32, dma_start_cmd: *mut bool, dma_done_ptr: *const bool) {
    let mut buf: [u8;3] = [0;3];
    buf[0] = 42; // this is eliminated unless one of the fixes is implemented

    pass_buf_to_hardware(dma_ptr, dma_start_cmd, &mut buf);
    wait_for_hardware(dma_done_ptr);
    
    // lifetime of buffer ends here in LLVM IR - no matter the position and variant of asm! statements.
}

#[unsafe(no_mangle)]
unsafe fn pass_buf_to_hardware(dma_ptr: *mut *mut u32, dma_start_cmd: *mut bool, buf_ptr: *mut [u8]) {
    asm!(""); // doesn't work - eliminates array initialization despite asm "~memory" clobber
    // asm!("/* {} */", in(reg) buf_ptr.cast::<*mut u32>()); // potential fix #1
    dma_ptr.write_volatile(buf_ptr.cast()); // Passes a ptr to an uninit array unless one of the fixes is implemented.
    // asm!(""); // potential fix #2
    dma_start_cmd.write_volatile(true);
}

#[unsafe(no_mangle)]
#[inline(never)]
unsafe fn wait_for_hardware(dma_done_ptr: *const bool) {
    while !dma_done_ptr.read_volatile() {}
    // asm!(""); // potential fix #3
}

In an attempt to understand what's going on, I was looking at the LLVM IR. Result: Any empty asm! boils down to asm sideeffect alignstack inteldialect "", "~{dirflag},~{fpsr},~{flags},~{memory}"() - no matter where I place it.

Here my specific questions:

  • Why does LLVM assume that the contents of the stack-allocated buffer will not be accessed despite an asm ... ~memory statement placed after it? According to the LLVM docs, ~memory should force LLVM to assume that any memory could be accessed in the asm! block. This is more than just placing a barrier to re-order optimizations. Shouldn't it force LLVM to assume that the stack-allocated, initialized buffer could also be read?
  • Fix #1: Why does an additional in-parameter to the asm! statement in the exact same location, just with the pointer to (but not the content of!) the buffer make a difference?
  • Fix #2: Why does a second empty asm!("") statement anywhere after escaping the pointer to (not the content of) the buffer to a volatile write make a difference? Shouldn't the ~memory clobber behave the same no matter where it is placed as long as it is placed anywhere after the initialization of the buffer?
  • Fix #3: How can an asm! statement in a different function spookily influence optimization of the parent function w/o any obvious change to the generated IR of the sub-function? Note: No inlining - confirmed in IR and assembly.

Obviously my understanding is still incomplete. Can anyone help me understand this?

1 Like

@newpavlov @boqun @alice - might be of interest to you... Or should this better be asked on some LLVM forum?

Fundamentally, we cannot analyze your code without also understanding what the hardware does. Based on your code, I believe that the hardware behaves like another OS thread that has this code:

// hardware
while !dma_start_cmd.read_volatile() {}
asm!(""); // acquire fence

let buf = dma_ptr.read_volatile();

// Uses a mutable reference since the hardware has
// exclusive access to the buffer right now.
hardware_uses_buffer(&mut *buf);

asm!(""); // release fence
dma_done_ptr.write_volatile(true);

For it to be correct that the hardware behaves like this, there are a few things that must be true:

  • The dma_ptr.read_volatile() call must read the value written by dma_ptr.write_volatile() in pass_buf_to_hardware.
  • The buf[0] = 42 call must happen before hardware_uses_buffer(&mut buf)
  • The hardware_uses_buffer(&mut buf) call must happen before buf goes out of scope.

If either of the above three things are violated, then the access in the other thread is illegal.

To write code that satisfies all three requirements, you must change your code to this:

#![no_std]
use core::arch::asm;

#[unsafe(no_mangle)]
pub unsafe fn do_dma(dma_ptr: *mut *mut u32, dma_start_cmd: *mut bool, dma_done_ptr: *const bool) {
    let mut buf: [u8;3] = [0;3];
    buf[0] = 42;

    pass_buf_to_hardware(dma_ptr, dma_start_cmd, &mut buf);
    wait_for_hardware(dma_done_ptr);
    
    // lifetime of buffer ends here
}

#[unsafe(no_mangle)]
unsafe fn pass_buf_to_hardware(dma_ptr: *mut *mut u32, dma_start_cmd: *mut bool, buf_ptr: *mut [u8]) {
    dma_ptr.write_volatile(buf_ptr.cast());
    asm!(""); // release fence
    dma_start_cmd.write_volatile(true);
}

#[unsafe(no_mangle)]
#[inline(never)]
unsafe fn wait_for_hardware(dma_done_ptr: *const bool) {
    while !dma_done_ptr.read_volatile() {}
    asm!(""); // acquire fence
}

Note that I applied both fix #2 and #3, but removed the original asm!("") statement.

To see why this satisfies all three requirements, we consider each one:

  • dma_start_ptr.read_volatile() reads the value written by dma_start_ptr.write_volatile(). Therefore, everything that happens before the release fence is visible to operations that come after the acquire fence. The buf_ptr.write_volatile operation comes before the release fence, and buf_ptr.read_volatile() comes after the acquire fence, so the read is guaranteed to see the value from the write.
  • dma_start_ptr.read_volatile() reads the value written by dma_start_ptr.write_volatile(). Therefore, everything that happens before the release fence is visible to operations that come after the acquire fence. The buf_ptr[0] = 42 operation comes before the release fence, and hardware_uses_buffer() comes after the acquire fence, so hardware_uses_buffer() is guaranteed to see the value from the write.
  • dma_done_ptr.read_volatile() reads the value written by dma_done_ptr.write_volatile(). Therefore, everything that comes before the release fence is visible to operations that come after the acquire fence. The hardware_uses_buffer() operation comes before the release fence, and buf going out of scope is after the acquire fence, so everything hardware_uses_buffer() did is visible when buf goes out of scope.

No. When it comes to memory that ordinary Rust code could access, inline asm is only allowed to do things that ordinary Rust code could do. It's impossible to replace the asm!("") with a piece of Rust code that modifies buf, so asm isn't allowed to do so either.

(Unless your piece of Rust code uses buf_ptr, but since it's not an input to the asm block, you're not allowed to use it in the code you replace the asm block with.)

Once you add buf_ptr as an input to the asm block, it becomes possible to replace it with a piece of Rust code that accesses buf. Therefore, the asm is allowed to do so too.

But this is not the correct fix, because this just allows you to access buf during that specific asm block. What we actually want is for the hardware to access it between dma_start_cmd and wait_for_hardware, rather than during the asm block. This fix alone is not enough as you still violate the third requirement that everything the hardware does comes before buf going out of scope.

No, it's different. After the volatile write, it's possible that buf_ptr is now stored in some sort of global variable somewhere, and the memory clobber allows the asm!("") block to access such global variables. Therefore, it's possible that the asm itself could access buf.

But it's not the correct fix because this also just allows the asm block itself to access buf. The actual accesses happen between dma_start_cmd and exiting the dma_done_ptr loop, and it's still not legal for buf to be accessed by another thread between those two commands because the third requirement is still violated.

This is almost the correct fix, but it's not quite enough on its own. With just this fix, I would say that we violate the first requirement. The write of buf_ptr no longer comes before the release fence, so we cannot argue that the write is guaranteed visible when the hardware reads dma_ptr. It's possible that the volatile read returns whatever pointer dma_ptr happened to hold prior to the write of buf_ptr.

As for how it can influence the parent function, I guess LLVM understands that without the asm!("") statement, the function accesses nothing but the provided pointer. When the asm!("") block is added, the function might also access things valid under the memory clobber such as global variables, so since buf_ptr might have escaped to a global variable, the inline asm itself could now access buf through that global variable.

4 Likes

@alice Thanks for writing up this detailed answer. I really appreciate that you spent so much time and effort. I also follow your argument 100% from the abstract perspective in Rust. Assuming that asm! conceptually behaved like an AcqRel compiler memory barrier, your argument is exactly what I'd expect - in line with the cited original topic.

This question was not about the expected semantics of the barrier in Rust, though, but exclusively about the validity of its current implementation in LLVM. I'm questioning whether the LLVM asm ... ~memory instruction correctly implements the release-side of the fence due to it's behavior wrt dead writes elimination.

Most notably, if the LLVM asm instruction was an adequate implementation for a release fence and assuming that hardware behavior should by conceived the way you describe it (i.e. as a "black box" transition on the Rust AM similarly to inline assembly or an FFI call), there should be no need for an additional acquire fence in the source thread.

Note that in my example I'm only trying to pass a synchronized message from the source thread (release) to the imagined hardware target thread (acquire) but not the other way round.

The question is in essence: Why does asm ... ~memory not keep the compiler from eliminating a write to life memory?

The reason may be that given ~memory, LLVM only assumes arbitrary writes to memory in which case eliminating a prior write would be justified:

a clobber string of โ€œ~{memory} โ€ indicates that the assembly writes to arbitrary undeclared memory locations

But then asm! would be too weak to serve as a proper release barrier, IMHO. It would have to assume arbitrary reads after releasing prior writes. If it did so it wouldn't be able to eliminate prior writes. This also explains, why it works correctly on the acquire side - but only there.

GCC includes reads in the definition of the memory clobber btw:

The "memory" clobber tells the compiler that the assembly code performs memory reads or writes to items other than those listed in the input and output operands.

Other than asm!, the off-label use of a Rust compiler fence behaves correctly on the release side. I haven't tested yet what it translates to in the IR. But in any case this justifies the recommendation from the Rust embedded team to use a compiler fence rather than asm! for the time being when preparing for DMA.

Update: Rust compiler fences translate to a fence instruction in LLVM. These are definitely not specified to be used stand-alone. But work in practice, i.e. "off label" until we have a better official solution.

@RalfJung I'm throwing around with terms here (Rust AM, ...) that I learned only very recently from the opsem threads. Would you mind checking that what I'm claiming here is correct from the formal specification side at least?

Please explain why you think that you don't need an acquire fence in the source thread.

There needs to be synchronization between the hardware using the memory and the memory being freed.

I think, I simply reached the same conclusion as Amanieu in Does the concept of a compiler fence make any sense? ยท Issue #347 ยท rust-lang/unsafe-code-guidelines ยท GitHub.

I'd not say this is an issue with LLVM, though, but simply works as advertised assuming that LLVM assumes only writes in its asm ~memory instruction.

It only means that an empty asm! is not a valid implementation of a release fence unless LLVM can be convinced to change their definition of the ~memory clobber to assume arbitrary reads as well (such as GCC does).

Sure, I'll try to make this more explicit:

The reason is that I'm only releasing memory from the source thread into the "virtual" DMA thread which then somehow needs to acquire it.

How exactly the DMA "thread" does acquire memory in practice cannot and doesn't have to be known to Rust/the compiler. This will be entirely platform-dependent. In ARMv7-M for example the DMA pointer register will typically be implemented as strongly ordered memory. If it is not you'd need to emit a dmb instruction before starting DMA.

Note, though, that my argument nowhere depends on these details. The Rust/LLVM side only needs to assume that the DMA thread will synchronize with the fence and acquire memory "somehow" to do the right thing. More specifically: It needs to assume that the DMA thread may observe arbitrary writes to memory that "happened before" the fence, i.e. that were released by the fence.

In the specific example I gave above, this is the end of the story: I'm not intending to read the contents of the DMA buffer back into the source thread. Therefore the "virtual" DMA thread doesn't have to release it and the source thread doesn't have to re-acquire it.

I agree, though, that if I wanted to read something back from that DMA buffer in the source thread, then I'd have to re-acquire its memory and synchronize with some imagined release fence on the DMA side which again might have to be implemented in a platform specific way, e.g. by another dmb instruction or whatever.

This reasoning is a bit off topic, though, as my argument works without referring to any Rust-specific memory or synchronization model. I'm argumenting exclusively inside LLVM IR.

I don't really know what you mean by this. If it can write, then it can also read.

This seems to be the incorrect assumption. Freeing memory counts as a write when it comes to data races, so the hardware must release the buffer back to the source thread before the source thread frees it by letting it go out of scope.

Whereas in the example linked by Amanieu, the function ends with a loop {} statement that prevents the variable from every going out of scope, so this is unnecessary in that example.

1 Like

Because asm blocks may only access Rust-visible memory memory that Rust code invoked in the same place would be able to access. If you write:

let mut x = 5;
func();
assert_eq!(x, 5);

then you know the assertion can never fail. LLVM knows that, too, and will optimize accordingly. And this remains true if you replace func() by asm!(""). ASM blocks don't completely bypass the language -- this is crucial, Rust code that uses asm anywhere would be completely impossible to optimize otherwise.

Again, let's consider what happens with regular Rust code:

let mut x = 5;
func(&raw mut x);
assert_eq!(x, 5);

Can the assertion fail now? Yes! The exact same reasoning applies to asm blocks. LLVM understands this of course, and so it doesn't optimize away the comparison any more.

Even without inlining, LLVM will analyze all functions and propagate that information upwards to callers of those functions. So it will know that your function doesn't do any writes and also doesn't do any synchronization, and then it can optimize the caller based on that.

If you want to ensure that LLVM truly does not know what happens in the callee, you need to do something like this:

fn test(func: fn(*mut i32)) {
   let mut x = 5;
   func(&raw mut x); // a truly unknowable function call
   assert_eq!(x, 5);
}

You've said, asked, and claimed more things than I have the time to check, sorry. Do you have a concrete self-contained question?

Exactly. As I keep saying, interpreting the behavior of LLVM on concrete examples requires knowing exactly what you are doing. All details matter.

2 Likes

This makes a lot of sense. Thanks for pointing that out.

No, all my questions and doubts have been clarified by your answer and that of @alice .

Also thanks for addressing the original issue in Potentially-observable store gets elided: asm block does not act as a compiler fence ยท Issue #144351 ยท rust-lang/rust ยท GitHub.

I've learned much over the last few days due to this community's impressive commitment and patience. Thank you all very much. I hope that these threads being public, others will benefit from them, too.

It's a pity that I can only mark a single response as result. If I could I'd chose several. I'll take Ralf's, though, as it adresses the original question most directly IMO.

1 Like

Happy that we got to the bottom of this. :slight_smile:

Now we just need to figure out how to improve our docs so that the next person with the same question doesn't need multiple people conversing over days to have their questions answered.^^

1 Like

This thread is an important piece of documentation that we all just contributed to the Rust community, don't you think? I propose you point the next person here. But more probably she'll find it herself through search. :slight_smile: So: well done. The investment will pay off many times over.

Additionally:

  • The questions around DMA would probably warrant a dedicated chapter in the Rust Embedded Book. Also examples around how to use compiler fences to synchronize with interrupt handlers would probably be helpful.
  • Atomic synchronization in general is documented in the Nomicon, but I think questions around fences, especially examples and proper use of the different kinds of fences (atomic fence, "compiler" fence and "asm!") could be exposed in more detail there. Also documentation and examples around footguns like "implicit dropping requires re-acquisition of memory" as well as some of the specific synchronization patterns explored here could be added.
  • Documentation of asm! & friends in the Rust reference is already very good. A paragraph about "asm! as a fence" could maybe be added, though, there, in the Nomicon or in the API docs. Also the examples discussed here could be added to "Rust by Example". They are not yet there, although the API documentation of asm! points to the examples as its main documentation.

I think if someone walked through the two threads we created and just moved over the examples and discussion to the corresponding docs would already be a huge benefit.

I'd love to contribute that but I think someone else with a bit more experience in Rust would do a much better job with much less effort. My contribution is currently limited to asking the dumb questions others also have in mind but don't dare to ask. :wink: But if you tell me where to file the corresponding documentation issues, I'd happily do that.

1 Like

It is a somewhat long and confusing pair of threads. It would be good to write down a summary of some sort. Perhaps a blog post or similar would be a good start. Perhaps something you could do a draft of at this point, I think you are underselling your newly gained knowledge.

2 Likes

FWIW: DMA - The Embedonomicon and the embedded_dma crate.

But also see The proposed DMA API is unsafe ยท Issue #64 ยท rust-embedded/embedonomicon ยท GitHub.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.