[AArch64, Cortex-A72] core::sync::atomic::AtomicU64 functions cause infinite loop

My bare metal AArch64 application uses global static variable

static MYSTATIC: AtomicBitmap64 = AtomicBitmap64::new();

where AtomicBitmap64 is

mod mymod {
    use core::sync::atomic::{AtomicU64, Ordering};

    #[repr(transparent)]
    pub struct AtomicBitmap64(AtomicU64);

    impl AtomicBitmap64 {
        pub const fn new() -> Self {
            AtomicBitmap64(AtomicU64::new(0))
        }

        pub fn set_bit(&self, n: u8) -> bool {
            let b = 1u64 << n;
            self.0.fetch_or(b, Ordering::AcqRel) & b != 0
        }

        pub fn clear_bit(&self, n: u8) -> bool {
            let b = 1u64 << n;
            self.0.fetch_and(!b, Ordering::AcqRel) & b != 0
        }

        pub fn invert_bit(&self, n: u8) -> bool {
            let b = 1u64 << n;
            self.0.fetch_xor(b, Ordering::AcqRel) & b != 0
        }
    }
}

The application disassembles to the following:

// halt cores 1..3, setup stack and go to main
 0:   d53800a9    mrs x9, mpidr_el1
 4:   92400529    and x9, x9, #0x3
 8:   b4000069    cbz x9, 0x14
 c:   d503205f    wfe
10:   17ffffff    b   0xc
14:   10ffff69    adr x9, 0x0
18:   9100013f    mov sp, x9

// MYSTATIC.set_bit(14);
1c:   90000008    adrp    x8, 0x0
20:   91016108    add x8, x8, #0x58
24:   c85ffd09    ldaxr   x9, [x8]
28:   b2720129    orr x9, x9, #0x4000
2c:   c80afd09    stlxr   w10, x9, [x8]
30:   35ffffaa    cbnz    w10, 0x24

// Turn on LED <-- this code is never executed
34:   d2bfc404    mov x4, #0xfe200000
38:   b9401080    ldr w0, [x4, #16]
3c:   12177000    and w0, w0, #0xfffffe3f
40:   321a0000    orr w0, w0, #0x40
44:   b9001080    str w0, [x4, #16]
48:   52808000    mov w0, #0x400
4c:   f9001080    str x0, [x4, #32]
50:   14000000    b   0x50

Looks like the code runs in infinite ldaxr/stlxr loop as code below offset 30 is never executed. It accesses offset 58 which is in .bss section in ELF object, but not in resulting binary image as it gets truncated by objcopy.
I have no idea why it is happening and I hope someone could suggest the solution or point at what I could be doing wrong. Thanks in advance!

Then that sounds like a binutils bug -- what version are you using?

Really? I thought uninitialized data should not appear in the image as it makes it bigger.

Anyway, it is GNU objcopy (GNU Binutils for Debian) 2.31.1 and my linker script:

SECTIONS
{
    . = 0x00100000;
    .text :
    {
        *(.text.init) *(.text)
    }
    .rodata :
    {
        *(.rodata)
    }
    .data :
    {
        *(.data)
    }
    .bss ALIGN(8) :
    {
        bss_start = .;
        *(.bss)
    }
    bss_size = SIZEOF(.bss);

    /DISCARD/ :
    {
        *(.comment)
    }
}

I guess it depends on what you mean by it being truncated. There still needs to be a .bss section with the appropriate size, so that address space is properly mapped and zeroed. It will usually have the NOBITS type though, so it doesn't require any space in the binary.

This doesn't seem to be the issue with missing zeroes in binary image. I just tried padding the image with zeros manually and it still freezes. (And LED turn on on my RPi4 board if I remove atomic write)

What does readelf -lS look like for the initial ELF object, compared to the linked/copied binary?

Not directly related to the issue at hand, but shouldn't you be using volatile writes to do MMIO rather than atomic writes?

@ FenrirWolf It's not for MMIO access, it's to store global variable that could be accessed by multiple threads (currently only one thread)

1 Like

cuviper,

There are 6 section headers, starting at offset 0x10278:

Section Headers:
  [Nr] Name              Type             Address           Offset
       Size              EntSize          Flags  Link  Info  Align
  [ 0]                   NULL             0000000000000000  00000000
       0000000000000000  0000000000000000           0     0     0
  [ 1] .text             PROGBITS         0000000000100000  00010000
       0000000000000054  0000000000000000  AX       0     0     16
  [ 2] .bss              NOBITS           0000000000100058  00010058
       0000000000000008  0000000000000000  WA       0     0     8
  [ 3] .symtab           SYMTAB           0000000000000000  00010058
       0000000000000120  0000000000000018           5     9     8
  [ 4] .shstrtab         STRTAB           0000000000000000  00010178
       0000000000000026  0000000000000000           0     0     1
  [ 5] .strtab           STRTAB           0000000000000000  0001019e
       00000000000000d3  0000000000000000           0     0     1
Key to Flags:
  W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
  L (link order), O (extra OS processing required), G (group), T (TLS),
  C (compressed), x (unknown), o (OS specific), E (exclude),
  p (processor specific)

Elf file type is EXEC (Executable file)
Entry point 0x100000
There are 3 program headers, starting at offset 64

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  LOAD           0x0000000000010000 0x0000000000100000 0x0000000000100000
                 0x0000000000000054 0x0000000000000054  R E    0x10000
  LOAD           0x0000000000010058 0x0000000000100058 0x0000000000100058
                 0x0000000000000000 0x0000000000000008  RW     0x10000
  GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000000 0x0000000000000000  RW     0x0

 Section to Segment mapping:
  Segment Sections...
   00     .text 
   01     .bss 
   02     

All starts working as expected as soon as I replace ldaxr / stlxr with ldr/str, so the problem is definitely not in how the binary is made, but in atomic write.

The behavior you're describing would not be a bug in Rust -- it's a sign that the processor is not succeeding in the atomic write.

The memory location at offset 0x58 is almost certainly MYSTATIC.

I haven't used the A72 specifically, but here are some questions:

You mentioned that your code is running on the bare metal. Is there something you have to do to enable or initialize the A72's "exclusive monitor" (ARM jargon for the hardware that detects conflicts in atomic accesses)?

What type of RAM are you interacting with? TCM, SDRAM...? Check that it can be used for atomic accesses at all. (For example, on most ARMv7-M cores, there's a section of address space that acts as Device memory where atomic operations have undefined behavior.)

Is it possible you've got an interrupt repeatedly firing, perhaps because it hasn't been cleared correctly by its ISR? On most ARM cores that's enough to cause you to lose the exclusive monitor.

Are there other cores? What are they doing? Accessing the same memory you're using (not the same address, but the same bus) is likely to cause you to lose the exclusive monitor.

Added: [this page] (http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.100095_0003_06_en/Chunk905102933.html) has reminded me that, on an A-class core, you have control over the memory attributes. So if you've marked the page as not-sharable and L1 cacheable in the page table, you're only dealing with the core's local monitor, which would be the simplest case to debug; the memory system's global monitor might require initialization.

1 Like

Hmm, I don't have interrupts or other threads/cores. Code runs in RAM. And I didn't enable memory management unit, no virtual-physical address translation set up. I also didn't initialize exclusive access monitor manually (only if bootloader firmware has done anything). Do I need to?
I've read in ARMv8 Ref Manual that exclusive access monitor can check for correct virtual-physical address match. Interesting, could that be the case?
cbiffle, Thanks for the link, I'll check this out.

1 Like

Unfortunately that link is to ARM docs, so it mostly just names other docs without linking them. :roll_eyes: But maybe it'll have the answer.

It's possible you need to set up page tables for load/store exclusive to work. I should get some A72 hardware, this sounds like fun.

2 Likes

Hi there,
as you are running the code on RPi you need to configure the MMU and enable caches to get the atomic operations to work.

2 Likes

Yes, it's true. I already figured that out. Thank you!

1 Like