Excellent question, thanks! Let me share how I've learned about error handling. I am afraid this won't work as a best way to actively learn the topic (as this is just a collection of anecdata), but it might still be useful.
Haskell's IO
Java's checked exceptions, Rust's Result and Haskell's IO monad all share the property that they are contagious. If f
returns a Result
, and g
calls f
, then g
most likely needs to return a Result
as well. This is a general property of function coloring, common to all effect-like things. What Haskell specifically made me realize is the core question of error management:
Which part of code can not return an error?
In Haskell, threading IO everywhere is really painful, so it forces you to cleanly separate the code into code inside IO, and pure code which can't do IO, and you win if the ratio between the two is 1:9 or thereabouts.
This transfers to Rust -- the most important thing is carving out the subset of logic which doesn't have to deal with errors, and then pushing this subset outwards.
Haskell's Mondas
The second great influence of Haskell for me was just the grunt skill of transforming Monads. It's pretty hard to wrap you head around all of the Result/Option combinators in a brute force way, but it's easy to see all of them as just special cases of few basic combinators. That's pretty powerful, you get to know things like "of course there should be a way to collect an Iterator<Item = Result<T, E>>
to Result<Vec<T>, E>
because that's Traversible
" without reading about this specific case in the docs.
Exceptions in Kotlin/Python
In both of this languages, I had to implement the same thing:
In the context of long-running "server" program, call external process which can fail
This taught me that Exception-based approaches fall down when the basic question ("is this situation truly exceptional?") is hard to answer. On the one hand, external process failing is kind of exceptional, and language APIs do use exceptions. On the other hand, in my particular context failure was an expected outcome. This lead to me planting a whole lot of bugs in the respective subsystems.
Midori Error Model
Joe Duffy - The Error Model -- this is a must-read article about error handling. The two core insights (which are not novel, reflected in many other systems, but not always as explicitly articulated):
- distinguish between programmer errors (bugs), which should be managed by fixing the code, and environmental errors -- things that can go wrong during normal operation of the program
- establish isolation boundaries -- rather than managing specific error conditions (
E1 || E2 || E3
), make it possible to recover any error (∀ E
).
Simple testing can prevent most critical failures
https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf -- another "I read a paper and was enlightened" case. The paper makes the following observation:
- Most of critical failures of complex systems a due to bugs in error-handling code (ie, a benign transitive error leads to the whole cluster of machines going belly up due to a bug in a
catch
clause).
- the bugs in
catch
clauses are really obvious.
- they are not discovered because the code in
catch
is never executed during dev because errors never happen.
This leads to a very practical advice -- do not have special rare code paths, as much as possible, strive to either use the same code path for all situations, or to artificially inflate frequency of rate code paths. In terms of code, rather than writing if let Err(err) = failable_op() { logic }
, put the logic
into a Drop
impl for some type.
Related ideas are:
- crash only software (do not have dedicated "orderly shutdown" sequence, make "emergency shutdown" work nice)
- sysadmin rule of "restore from the backup every night"
- fault injection (chaosmonkey)
IntelliJ's red lightbulb
When some part of IntelliJ throws, the user sees a red lighbulb with logs & Exception stack trace. The IDE continues to work normally. And, if you use IntelliJ long enough, you will see the red lighbulb. This for me was a very practical rendition of isolation boundaries from Midori model.
An IDE is a very large and very complex bit of software. It is impossible to just will the bugs out of existence, IDEs are big enough to always be buggy. So, instead one should embrace the bugs, and make sure that the overall user experience is not compromised to much (bugs don't crash the whole system, are easy to report (for users) and observe (for devs), and fixes are timely).
Adding HTTP interface to a service
A long while ago I worked on some Rust service which was also exposed via HTTP using iron
framework. I've noticied a curious pattern -- the HTTP layer used meticulously crafted enum Error
, which united a dozen of dissimilar errors from various components. Ultimately, all the errors actually resulting in a display of 4xx or 5xx page. That is, all of the carefully preserved information about the error was erased into a single status-code and an error message. I was able to greatly simplify the code by replacing enum Error
with struct HttpError(code: u32, msg: String)
. I've also learned very concretely about the benefits of type-erased error handling, a-la anyhow crate.
snafu
Reading the docs of the snafu crate taught me about the other approach -- the error kind pattern. One of the ideas of snafu is (roughly) that From
impls for error types are public API, so, ideally, you should be able to create errors without resorting to From
impls. More generally, there's Error API which you use as a library implementor, and there's Error API which is used by the consumers of your library, and its important to not mix the two. As usual, every abstraction has two interfaces -- one for the user, one for the implementor.