Debugging: program immediately crashes on embedded target

I'm writing and testing library code (with no_std) and can currently build it for x86 and nRF52840 (ARM Cortex M4).

Everything works fine for x86 builds -- I can use the library's API functions and the code works.

Basically there are three API calls -- keygen(), sign() and verify().

I can also build the example code and library the nRF52840 target. Some of the library's API functions work fine.
But there's at least one function (sign()) that causes problems on the target -- just having the function call in main() causes the program to immediately crash, i.d. I can't even use any breakpoints to start debugging, and even the previously function call (before the "trouble function") seems not to get executed.

Error message

ERROR probe_rs::architecture::arm::core::armv7m: The core is in locked up status as a result of an unrecoverable exception

opt-level = "z"

With opt-level = "z" I get a continuously running program and when I stop with CTRL-C in gdb, it reads:

^C
Program received signal SIGINT, Interrupt.
lms_demo_nrf52::__cortex_m_rt_HardFault_trampoline (warning: Could not fetch required XPSR content.  Further unwinding is impossible.
frame=0x1fff0058) at examples/lms_demo_nrf52.rs:19
19	#[exception]

opt-level = 0

With opt-level = 0 I can stop at a breakpoint and then continue, from there on it's similar, I have to stop via CTRL-c and then:

^C
Program received signal SIGINT, Interrupt.
lms_demo_nrf52::__cortex_m_rt_HardFault_trampoline (frame=<error reading variable: Cannot access memory at address 0x1ffffb04>) at examples/demo_nrf52.rs:19
19	#[exception]

At line 19:

#[exception]
unsafe fn HardFault(_ef: &cortex_m_rt::ExceptionFrame) -> ! {
    dk::fail();
}

Noteworthy

There's a configuration option that can be changed and then allows the sign() function to run. It's not a fix because it's a wrong configuration, and our library code reports an error (which is the expected behaviour in that case).

It's originally not my code, but I'll try to review this to see what changes with this configuration option.

I first was assuming problems with stack allocation, but afaik, the stack should only cause trouble at runtime.

Questions

  • What could possible make the program crash right at the start?
  • And are there ways to debug this?

Update

Using the workaround for main() from this post slightly changes the situation:

  • when just flashing and executing the program, I get some debug output right after entering main() via defmt::println!("main()"); (which wasn't there before)
  • when debugging, I also don't get the crash immediately, but at the push()
  • but the keygen(), which gets executed successfully if there's no call to sign(), does not get executed anymore (although the code before sign() is unchanged)
let mut vec_hss_params: ArrayVec<[_; REF_IMPL_MAX_ALLOWED_HSS_LEVELS]> = Default::default();
    let hss_parameters = HssParameter::new(
        LmotsAlgorithm::LmotsW1, LmsAlgorithm::LmsH5);
    vec_hss_params.push(hss_parameters);

So I'll dig into that with the debugger.

apparently, you had a hard fault somewhere in your code. however, this warning is concerning:

I'd never seen this warning before so I'm not sure how corrupted the program state is. but you can try to inspect the CPU registers to get some clue. if any register value looks like a memory address, try to inspect that memory location as well.

many people have written their experiences debugging hard faults. you search and read them, maybe you could get some inspirations.

as a general rule to diagnose this kind of crash (very likely caused by UB), the first step is to find unsoundness in unsafe code.

Thanks, I'll consider those hints.

The value of XPSR is 0x61000003 and confirms the HardFault (exception number 3).

XPSR

I'm wondering what this messages means:

Could not fetch required XPSR content. Further unwinding is impossible

What does it try to fetch? The content of the register is readable via gdb and i r => XPSR = 0x61000003, it confirms the HardFault.

Weird memory location

Assuming this is what the memory layout looks like:

MEMORY
{
  FLASH : ORIGIN = 0x00000000, LENGTH = 1024K
  RAM   : ORIGIN = 0x20000000, LENGTH = 256K
}

then (frame=<error reading variable: Cannot access memory at address 0x1ffffb04>) makes sense.

This should be unused memory. It's 0x4FC below the first RAM address 0x20000000, and because of flip-link (which is supposed to move the stack to the lowest addresses in RAM) and the "negative" value, to me it seems like a stack overflow issue: only the stack would grow to lower addresses and thus could reach a value below 0x20000000.

you are probably correct. the address 0x1ffffb04 looks pretty much like a stack overflow to me. maybe the hard fault is raised due to a double fault (for example maybe the MPU MemoryManage handler is misbehaving, or entirely missing).

I'm not sure how your low level support library is implemented, for cortex-m-rt based runtime, it should be named MemoryManagement and annotated with the exception attribute. it can return, or it can diverge, depending on the implementation.

#[exception]
unsafe fn MemoryManagement() -> ! {
    //...
}

if it is missing, the exception is handled by the DefaultHhandler.

either way, you can try to set a break point at the handler, to see if it's possible to catch the first occurance of the stack overflow. if you managed to succeed, you have a higher chance to get a full stack trace of the call stack.

From that I understand that, in case of a stack overflow, I should expect a MemoryManagementFault instead of a HardFault

I've checked with grep -iIrn HardFault ~/.cargo/registry:

pub(crate) enum ExceptionReason {
    /// No exception is active.
    ThreadMode,
    /// A reset has been triggered.
    Reset,
    /// A non-maskable interrupt has been triggered.
    NonMaskableInterrupt,
    /// A hard fault has been triggered.
    HardFault,
    /// A memory management fault has been triggered.
    MemoryManagementFault,
    /* ... */

which supports that thought. grep -iIrn MemoryManagement ~/.cargo/registry/ yields what I understand as a declaration for function of some C library:

./registry/src/index.crates.io-6f17d22bba15001f/cortex-m-rt-0.7.3/src/lib.rs:

extern "C" {
    fn Reset() -> !;

    fn NonMaskableInt();

    fn HardFaultTrampoline();

    #[cfg(not(armv6m))]
    fn MemoryManagement();
    /* ... */

For handling the HardFault, I've used the piece of code provided by the rust-exercise's nrf52-code/radio-app/src/lib.rs:

/// The default HardFault handler just spins, so replace it.
/// probe-run used to set a hardfault breakpoint but probe-rs doesn't, so make
/// the HardFault handler quit out of probe-rs with a breakpoint.
#[exception]
unsafe fn HardFault(_ef: &cortex_m_rt::ExceptionFrame) -> ! {
    dk::fail();
}

I'll look for the DefaultHandler and will try setting a breakpoint there.

yes, the MPU violation should raise a MemoryManagementFault, which by default has priority 0. but the original MemoryManagementFault can escalate into HardFault under certain conditions, for example, due to mis-configured exception vector table, or the exception is not enabled, or other similar conditions.

and that's why it is important to know how the system's low level initialization is done.

extern "C" is just declaring a function with C ABI. extern means the function is only resolved at link time (as opposed to importing a regular pub function, which must be resolved at compile time), it doesn't necessarily needs to be written in C (and in this case it should'nt). the reason to use extern is because of inverted dependency on downstream crates. in theory extern "Rust" can also be used for this.

if the provided code of the exercise didn't properly setup MemoryManagementFault, that's could be the reason why it escalated to HardFault. if that's the case, you can try to setup the MemoryManagementFault handler in your app code instead of relying on the exercise template, which would hopefully help you find out the exact location where the stack overflow happened.

1 Like

Thanks for the support and sticking with me.

For now (and maybe for good...) I unfortunately have to stop working on this issue -- because it's a task in a time-limited project. I'm gonna "lose" the target hardware for testing soon, but someone else hopefully will take over and benefit from my documentation and those discussions.

In the worst case, this issue probably can be understood much better by digging more and more into the program (with breakpoints) to see where the stack pointer performs this huge "jump" into nowhere.

Remark:

I'll write a more general question, which addresses the scenario (as already described at the start of this thread), because I'm curious how the following can happen:

  • calling one library function ("first") is fine
  • if another library function call is added after the "first", even the function called first will not finish anymore (and instead crash with what's most likely a stack overflow)

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.