Ensure program termination?

I have an application running in production, and it ends basically like.

#[tokio::main]
async fn main() -> Result<()> {
    let _sentry = setup()?;
    match MyApp::new().run().await {
        Err(err) => {
            tracing::warn!(?err, "exiting due to err");
            sentry_anyhow::capture_anyhow(&err);
        }
        _ => tracing::info!("exiting gracefully"),
    };

    Ok(())
}

This has been fine for a long time, but two weeks ago a third party service was having some issues and there were a lot of errors and then the last line on some of my workers was "exiting due to error", then they hung doing no work for two weeks. I have no idea if the problem originates in my code or in some crate's code. It didn't cause a problem this time because there were enough other workers that didn't run into this, but it could have been bad if too few of them had survived.

There are a ton of background tasks running in my program (librdkafka, periodic cleanups, concurrent lookups, etc), and it would be incredibly difficult to thread channels throughout my program to ensure they all exit cleanly through main on error, assuming that is even the problem. I really don't care if they exit cleanly, I just need this program to die so that a new one can pick up where it left off.

Any advice?

Add some monitoring for your services?

Bugs happen. It maybe bug in your Rust code, third-party crate, it maybe bug in your service configuration, it maybe even hardware bug… the only way to cope is to ensure that something outside of your worker verifies how everything works and either restarts workers or notifies you.

If you observe issues too often then you start acting. Both because it becomes a problem for you and because it becomes possible to debug.

2 Likes

This shouldn't happen. It suggests that sentry_anyhow::capture_anyhow(&err) somehow encountered a deadlock, or an infinite timeout when trying to communicate with an inaccessible service. The way to ensure that the program exits is normally, well, just to exit it.

As @VorfeedCanal said, you need some external monitoring to guard against such sporadic errors. You should have something like a healthcheck service running, with an external watchdog monitoring it. If the service becomes unresponsive, the program is stuck, so the watchdog should forcefully terminate it via OS measures, e.g. send a SIGKILL.

Add some monitoring for your services?

That's a good point. I can't make sure the program dies from within the program, but I could put some sort of liveness check that is guaranteed to fail once main exits.

I think that is probably the right approach.

Reminder of the Halting problem - Wikipedia : there's no analysis that can tell that your program will end.

So as others have said, you need external heartbeats, active monitoring, etc.

1 Like