Ensure program termination?

mindreader · December 1, 2022, 2:25pm

I have an application running in production, and it ends basically like.

#[tokio::main]
async fn main() -> Result<()> {
    let _sentry = setup()?;
    match MyApp::new().run().await {
        Err(err) => {
            tracing::warn!(?err, "exiting due to err");
            sentry_anyhow::capture_anyhow(&err);
        }
        _ => tracing::info!("exiting gracefully"),
    };

    Ok(())
}

This has been fine for a long time, but two weeks ago a third party service was having some issues and there were a lot of errors and then the last line on some of my workers was "exiting due to error", then they hung doing no work for two weeks. I have no idea if the problem originates in my code or in some crate's code. It didn't cause a problem this time because there were enough other workers that didn't run into this, but it could have been bad if too few of them had survived.

There are a ton of background tasks running in my program (librdkafka, periodic cleanups, concurrent lookups, etc), and it would be incredibly difficult to thread channels throughout my program to ensure they all exit cleanly through main on error, assuming that is even the problem. I really don't care if they exit cleanly, I just need this program to die so that a new one can pick up where it left off.

Any advice?

VorfeedCanal · December 1, 2022, 2:33pm

Add some monitoring for your services?

Bugs happen. It maybe bug in your Rust code, third-party crate, it maybe bug in your service configuration, it maybe even hardware bug… the only way to cope is to ensure that something outside of your worker verifies how everything works and either restarts workers or notifies you.

If you observe issues too often then you start acting. Both because it becomes a problem for you and because it becomes possible to debug.

afetisov · December 1, 2022, 3:42pm

This shouldn't happen. It suggests that sentry_anyhow::capture_anyhow(&err) somehow encountered a deadlock, or an infinite timeout when trying to communicate with an inaccessible service. The way to ensure that the program exits is normally, well, just to exit it.

As @VorfeedCanal said, you need some external monitoring to guard against such sporadic errors. You should have something like a healthcheck service running, with an external watchdog monitoring it. If the service becomes unresponsive, the program is stuck, so the watchdog should forcefully terminate it via OS measures, e.g. send a SIGKILL.

mindreader · December 1, 2022, 4:37pm

Add some monitoring for your services?

That's a good point. I can't make sure the program dies from within the program, but I could put some sort of liveness check that is guaranteed to fail once main exits.

I think that is probably the right approach.

scottmcm · December 1, 2022, 11:12pm

Reminder of the Halting problem - Wikipedia : there's no analysis that can tell that your program will end.

So as others have said, you need external heartbeats, active monitoring, etc.

system · March 1, 2023, 11:12pm

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Is it safe to end a program raised a fatal error by panic in an async context?	3	350	March 4, 2023
Opentelemetry with async code help	3	580	August 3, 2022
Panic in Tokio task does not end the program execution help	5	5047	October 10, 2020
Program termination code review	12	431	January 31, 2024
Tokio does not terminate all tasks immediately on program exit help	23	1410	January 3, 2024

Ensure program termination?

Related Topics