Best practices around error handling, diagnostics and alerting in production

Hey community,

Classifying app errors into the following three categories:

"Domain Errors" - all the business invariants
"Infrastructure Exceptions" - all the SRE stuff
"Panics" - all the unhandled things, to be fixed or funnelled into one of the above buckets

The requirements are that all errors are made visible (e.g. ELK stack) to allow for better diagnostics, insights and alerting (e.g. PagerDuty).

I searched around but couldn't find much on the topic, please reply if you know of any guidelines, best practices or concrete examples.

Thanks!

I'm not sure about which part of the problem you're asking exactly.

  • You can transform panics into errors with catch_unwind and most server frameworks do that.

  • Uncaught panics will end the process with status 101 and error logged to stdout, so it's good to have logging/monitoring of this.

  • For libraries the best practice is to make errors an enum of all the cases that go wrong, so that library user can match on it and handle/classify the cases as they wish.

  • Larger applications that glue libraries together tend to use a "catch-all" type for all errors (Box<dyn Error> or its fancy-crate equivalents). You can still get some machine-readable information out of these by following chain of errors returned by error.source()

How you send that out of the app into your infrastructure is up to you. It can be sentry, it can be logs, or whatever — Rust doesn't have a specific recommendation here.

2 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.