Hard fault due to invalid stack pointer

Hi!

I'm getting a hard fault on nRF52480 due to Rust generating a stack pointer that points into the FLASH memory region. This happens with 1.85.1 as well as 1.84. A proper memory.x file is present in the build target.

I tracked down the fault to the following assembly instructions:

; SP: 0x2000 34E0
0000C46A   SUB.W          SP, SP, #0x025C00
; SP: 0x1FFD D8E0 - this is already wrong because RAM >= 0x2000 0000
0000C46E   SUB            SP, SP, #32
; SP: 0x1FFD D8C0
0000C470   STR            R1, [SP, #76]   ; Triggers the Hard Fault

These instructions are being generated within the preamble of an async method call in the embassy framework:

impl<R, Rng, U, TIMER> Device<R, Rng, U, TIMER>
where
    R: Radio,
    for<'a> R::RadioFrame<&'a mut [u8]>: RadioFrameMut<&'a mut [u8]>,
    for<'a> R::TxToken<'a>: From<&'a mut [u8]>,
    Rng: RngCore,
    U: UpperLayer,
    TIMER: DelayNs + Clone,
{
    // The preamble to the following method fails!
    pub async fn run(&mut self) -> ! {
        let (mut tx, mut rx) = (Channel::new(), Channel::new());
        let (tx_send, tx_recv) = tx.split();
        let (rx_send, rx_recv) = rx.split();

        let mut tx_done = Channel::new();
        let (tx_done_send, tx_done_recv) = tx_done.split();

        let mut phy_service = PhyService::new(&mut self.radio, tx_recv, rx_send, tx_done_send);
        let mut mac_service = MacService::<'_, Rng, U, TIMER, R>::new(
            &mut self.rng,
            &mut self.upper_layer,
            self.timer.clone(),
            rx_recv,
            tx_send,
            tx_done_recv,
        );

        match select::select(mac_service.run(), phy_service.run()).await {
            Either::First(_) => panic!("Tasks should never terminate, MAC service just did"),
            Either::Second(_) => panic!("Tasks should never terminate, PHY service just did"),
        }
    }
}

I've no idea at all how I could further debug that. Help is very much appreciated!

this looks like an oversized stack allocation. you must have created very large objects, probably indirectly as some async function's synthesized Future type, but it's also possible you just created a large struct or array by accident.

note, it's not just named variables, temporary values are allocated on the stack too.

3 Likes

Thanks @nerditation ,

Yes, must be an oversized future. Sounds like a probable hypothesis. But...

  1. How can I track this down? Does it make sense to look into the ELF file? See update below.
  2. This is a hard fault in safe code... Isn't Rust supposed to catch that kind of error? Nope - not for stack overflows.

UPDATE: Confirmed - As stack allocations in the pre-amble always follow the same pattern it's easy to see which function allocates what. By looking at the SP offsets I can estimate variable/future size. That works.

You can use cargo +nightly rustc -- -Zprint-type-sizes to get a dump of the size of every data type, including async block types. It contains enough variable name and type info that you can see what the async block is doing at each step.[1]

One thing that can be useful is to make sure that all variables are dropped as soon as possible (by restricting their scope or explicitly drop()ping them) — particularly, if you create them and drop them with zero intervening awaits, then they don't have to be stored in the async block type at all. But -Zprint-type-sizes will let you see exactly what is kept and whether your changes had the desired effect.

Yes and no. Where possible, Rust uses stack probes and guard pages to protect against stack overflows. This is a mechanism where:

  • At the end of the available stack space, there is a special page set to be unwritable (the guard page).
  • Whenever a stack frame is allocated, the compiler will generate code that writes at least one byte to each page (probes) between the old stack pointer and the new, so that the guard page will be hit before any memory entirely outside of the stack is. (In the case where the expansion of the stack is not greater than one page, no probe code is needed at all.)
  • A segfault handler is installed to attempt to give you a nice “stack overflow” error message. But, that is just a nicety for debugging; the actual safety mechanism is the segfault.

But for these mechanisms to work automatically as intended, there needs to be an OS that Rust knows how to make system calls to, to set up the guard page and the segfault handler. In a no_std environment, Rust doesn’t know what to do, so to get the full protection in the same way, you would need to set up a guard page yourself (and your hardware must have a MMU).

But perhaps this has already been done for you statically, by arranging so that the stack is next to always-read-only flash address space, with nothing else to stomp on in between? If so, that's just as good as an explicit guard page.


  1. I added some of it. ↩︎

6 Likes

This is a great answer. Thank you so much for taking the time to write it. :slight_smile: