So we distribute a service written in rust to our clients who run it on their IOT devices. One of the clients who runs it on embedded android is seeing crashes in this service because of memory corruption. We don't have any unsafe code in our codebase, so if it is happening because of a bug in our program and not because of hardware error (unlikely because they saw the crash on many devices), it has to be unsafe code in one of our 400 dependencies. Either that or a bug in the compiler.
I tried using cargo-geiger but it panics and crashes when processing some of the dependencies because they have some rust files with syntax errors in their source tree, apparently for some kind of testing, and geiger doesn't like that.
Get into contact with your clients and try to recreate the crash, if you can recreate the crash it can be considered to be solved, there are many ways to debug it then, you can check the call stack and find the last function that was called, or you can do it manually(depends on the project).
Your best option would be to get coredump of the crashed process, or reproduce it while you have a debugger attached. Then, if the corruption did not mess up too much, you'd be able to investigate what the program was doing, and what got overwritten with what.
Check if the crashes aren't limited only to specific devices, or devices with a specific OS version or other software running. Sometimes the hardware is broken, or the operating system libraries are corrupted/buggy/incompatible.
Another thing you can do is log what your process is doing, especially before calling dependencies that you suspect may be the cause, and then ask for logs from the crashing devices, and try to deduce from the logs where things go wrong.
If your service, or significant parts of it, can be run on some fast Linux machine, you could try running it with Address Sanitizer enabled, and fuzz it.
Try switching Rust's allocator to another one like jemalloc. If the problem is caused by heap corruption, a different allocator might catch it. Jemalloc has options to overwrite freed memory, and perform extra checks and abort if it detects something wrong.
If you want to review your dependencies to ensure they don't do anything unsafe, see
assuming you have custom hardware, it may still be a hardware issue if there's a hardware design flaw, e.g. if the crash was caused by software switching some device which causes a power transient that glitches out the processor due to not enough power-supply decoupling capacitors. so definitely look for software issues but it may also be worth checking the hardware. hardware issues don't necessarily occur in all or none of the devices, it could be that some processors/power-supplies/etc. are randomly more sensitive than others and the hardware is just on the verge of causing a problem, hence why only some of them have issues.