How to implement failure boundaries?

Hi! Sorry for the vague question, but it genuinely interests me.

All programs have bugs, but they must do useful work anyway.

Some programs can be simply restarted (rustc, cargo, rustfmt).

Other programs must work for a long time and not crash. The prime example is a threaded web server which continues to serve clients even if one of them sends some highly unusual and unexpected data :slight_smile:

In Rust, good bugs are manifested as panics, and so my question is how do you make a panic not to crash your long running process? How do you install an isolation boundary?

The threaded server is an easy case, because a thread is a natural boundary. But I think that sometimes you want a more fine grained isolation. What would you do for a single threaded async server?

Another interesting example is a text editor. Some edit commands will cause index out of bounds panics. How these panics should be prevented from killing the whole editor?

1 Like

There's catch_unwind which basically acts like a thread boundary. It should be used very sparingly.

1 Like

In a long running application – a daemon – I would actually do what I have seen a few times already:

First I would implement the whole logic and actual work as some sort of library, that at best just exposes a Builder Pattern and yields some struct that implements Iterator so you can easily iterate (within a main added as a binary inside the library) over all the requests (as done by the TcpListener and then pass the yielded data on to some other exposed utility function or handler struct (also created using a builder pattern?).
This covers basic configuration for the server instance and the way each request is handled.

The handler struct would have a method called handle or something and would return a Result (can have () as the Ok-type).

This leaves configuration (maybe read a file) and error handling (printing to stderr or even using syslog) to your main .
Using this design makes it possible to easily replace your frontend, say replacing the simple binary by a library in turn loaded into an application server.

TL;DR: Using Result (and the try! macro) to let your error bubble up is highly suggested for this use case, instead of just panicing as soon as you encounter an error.

Edit: Actually this enables you to use error enums and the like to ensure you can easily handle different types of errors too.

Yep, of course using Result is the preferred way to deal with expected failures. But I want to know how to deal with bugs, so using try! is not an option because there is no Result.

But this Iterator is actually a good candidate for wrapping catch_unwind around. I think that the pattern here is to structure an application in the request/response fashion (even if completely synchronous and in the single thread), and use one response as an isolation boundary (provided that important state updates are transactional).

About the Iterator model:

I have the request/response model already for some stdin/TcpStream based communication by getting an iterator over the lines of input and then mapping them around so I had some usable iterator which I could simply iterate over and match whatever I wanted to (here it also helps to implement FromStr for an enum to use str::parse).

If you really wanted to you could then easily extend this for usage with a threadpool and make it parallel.

Hm, I was going to write a yet another small interpreter, and instead ended up writing a resilient REPL service. I guess I've read to many Erlang blog posts this weekend :slight_smile:

The code is here, what do you think about it?

The main insight for me was that we can use OS threads as Erlang processes, as long as the number of threads is constant ("bounded by the application architecture").

In the case of a text editor, because of the possibility of power loss, everything from Vim to Microsoft Word keeps a "swap" or "autosave" file which is periodically saved with the user's latest keystrokes, even if they don't save to the original file. If your editor does that too, crashing the whole process isn't the end of the world. This should only be due to a bug - the design should never call for invalid commands entered by the user to cause a panic.