FickleServiceError

I have decided to go distributed, and now I have welcomed unto myself the woes of distributed. :confused:

One problem I am having is that my code has no (explicit) concept of a service that can fail due to being temporarily unavailable. it does many things, and makes use of ? a lot. when the error is bubbled up to its final location, I need something that lets my code know that it can "attempt this task again", it should not do the standard "log error and give up :man_shrugging:". But my code interfaces with many other crates in my workspace. So I was wondering...

Could I maybe make some kind of:

pub enum FickleError {
    TryAgain,
    Other(Box<dyn Error),
}

and have my function in question return a Result<T, impl Into<FickleError>> so that I demand that the crates upstream (in my workspace) impl Into<FickleError> for their errors so that they help me understand when I can possibly try again.

Two things:

  1. To achieve your specific plan, it would probably be better to add a trait that errors implement:

    trait IsRetriable {
        fn is_retriable(&self) -> bool; 
    }
    

    Then, each error enum can implement this to return an appropriate boolean. That way, you don't have to use a single type for all cases where this property appears, nor type-erase all other errors.

  2. You will probably find that it isn't easy to classify errors as retriable or not. Particularly, in a complex system, there are many errors that may be consequences of something being overloaded, but aren't inherently about that.

    A classification I've found more useful to start with is the HTTP 4xx/5xx distinction: errors which are "the request you sent us contains an error and cannot succeed" vs. "we had a problem processing your request". If you get the first kind, never retry because it is almost certainly futile; if you get the second kind, maybe retry — use more specific information if you have it (e.g. HTTP specifies for each error code whether it should be retried), and if you don't, retry but not too hard.

    And don't forget to think about (and test) the overall dynamics of the system under heavy load. Beware excessive retries that multiply the load and amplify minor glitches into catastrophe.

I am glad to see that you say "when the error is bubbled up to its final location". Nested retry policies can result in hiding problems and creating excessive latency. Remember that the real “final location” isn't necessarily in the same server.

4 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.