Release build slower than debug on ARM (Yocto cross-compile)

Hey everybody,

I have written an application for some networking tasks that runs in the user space of my Yocto-Linux.
I noticed a dip in performance running the --release version.
Afterwards, I’ve tried different opt-levels and explicitly setting target-cpu, but without improvement.

I realise it is probably a "me problem", since i have a hard time finding any sources talking about problems with the optimization. If you got any pointer on how to pinpoint the issue, I would greatly appreciate the help. Thanks in advance!

Context:

  • Target: Rock Pi E (Arm Cortex‑A53 64‑bit)
  • Build system: Yocto, using a sourced SDK
  • Cross-compiling Rust (not building natively on the device)

Can you share your code? Have you tried profiling to find out which parts of the code have their performance increased from not using --release?

I do find it interesting that this is the second case reported here in a short time (you'd have to find the other post); is it possible there's been a regression?

Do you mean this?

No, it was more recent than that one. I don't believe it's been closed yet... let me try and find it.

Edit: here it is: Chess engine is faster with debug-assertions = true

For the chess engine, it was finally just an -O3 vs -O2 issue. But the OP said that they tested with different opt-levels already. Of course, it is not really nice that -O3 can be 20 percent slower than -O3 for a few apps, but I have currently not the skills, time, and motivation to debug that.