Long running servers, detect 'might panic'

Disclaimer: I know about the halting problem. Let's not discuss that. We can avoid the issue because we allow "correct programs" to be labeled as "might panic."

Question: I'm writing long running servers in Rust. One thing I am increasingly concerned abut is that there is something that might panic / crash after a long time. For example: some u16 or u32 overflows. Does not show up in small unit tests, but keep the server running a few days or weeks, and BAM, sudden crash.

Does rustc have a 'paranoid' mode that highlights things that might cause overflows / crashes?

Again, halting problem does NOT apply here because we allow "correct programs" to be labeled as "might cause overflow".

Will the link-time check, like no-panic crate does, be of any help here?

1 Like

You could set #[deny(clippy::integer_arithmetic)] to deny the implicitly panicking/overflowing + - * / operators, then use only the checked_, saturating_, or wrapping_ arithmetic methods as appropriate for each particular usage.

This doesn't do any analysis of whether a specific numeric variable/field will or won't overflow; only ensures that no implicit-panic-or-wrap overflow (or divisions by zero) will exist in your program (except those that might be contained in libraries you call).

The best answer is the Erlang one: just allow it to crash, because you can't keep it from happening reliably. And have a small, clearly-correct program to restart it if it fails. (A Watchdog timer - Wikipedia, for example.) And have good diagnostics to fix the problem so it doesn't happen again once you know about it.

(Though multi-user services are one of the few acceptable uses of catch_unwind in std::panic - Rust, so that you can exit more gracefully, rather than just hard rebooting right away, for things that are just panics. But that doesn't solve other kinds of problems, like stack overflows, so being able to restart is still the best overall plan.)

4 Likes

To reframe the problem a bit, you're looking for a classifier that will divide the space of all rust programs into two sets, A and ¬A, such that all programs that panic on some input are in ¬A. Allowing programs that never panic into ¬A does avoid the halting problem, but introduces a new one: defining a useful lower bound for A.

Labelling all programs as "might panic" trivially meets these conditions, but is an unsatisfying answer to your original inquiry. The program fn main() {}, for instance, feels like it should be successfully identified as non-panicking. The interesting question then becomes what you mean by "might", i.e. Can we make the "definitely doesn't panic" set large enough to contain useful programs?

I don't have an answer, by the way. My intention here is to forestall useless semantic discussions about which programs "might panic"— Reasonable people can disagree about the meaning of that phrase, which is at the heart of your question.

Purify u32. Subsequently, do not expose ordinary u32 anymore, so that your program does not become tainted as quickly.

#[derive(Debug)]
struct U32(Option<u32>);

impl From<u32> for U32 {
    fn from(x: u32) -> Self {Self(Some(x))}
}

fn bind_binary(x: U32, y: U32,
    f: impl Fn(u32, u32) -> Option<u32>
) -> U32 {
    U32(x.0.and_then(|x| y.0.and_then(|y| f(x, y))))
}

impl std::ops::Add<Self> for U32 {
    type Output = Self;
    fn add(self, rhs: Self) -> Self {
        bind_binary(self, rhs, u32::checked_add)
    }
}

fn main() {
    let x = U32::from(0xffffffff);
    let y = U32::from(1);
    println!("{:?}", x + y);
}

In theory, I agree with you. In practice, I would love to do this. However, it's not clear to me how to get the best of Erlang's crash-recovery philosophy and Rust's performance. Problems that come to mind:

  1. panic while holding a Mutex
  2. panic while modifying some state, and half way thorugh
  3. Erlang calling Rust NIFs => not too useful, as panic in Rust can take down Erlang

Do you know of any real world systems that has managed to mix the best of Erlang & Rust ?

Even this program can panic in practise if eg a memory allocation fails.

This is absolutely correct. The same argument can be made of type checking via:

fn f(h: turing_machine, m: input) {
  h(m);
  2 + "hello world"
}

if h(m) infinite loops, there is no type error; if h(m) halts, there is a type error. Thus, one could argue that type checking is as hard as halting problem.

Yet, despite this, we have very useful type systems (like Rusts) where some "programs w/o type errors" are labeled as "might have type error" and banned.

So I'm looking for something that "extends" the type-checker a bit in this direction.

Sorry for all this vagueness / analogies; if I knew precisely what I needed, I'd be using it, rather than asking if it exists. :slight_smile:

I believe in erlang all state is local to a task. Crashing brings down a whole task, so you don't have to worry about state inconsistencies.

Long before Rust, I used daemontools to manage this sort of thing for arbitrary programs.

I should clarify: the three bullet points are not meant as "problems of Erlang" but "problems of trying to do Erlang style programming in Rust" -- i.e. these two things can happens in Rust, and because Rust has a shared memory model where as Erlang 'process have their own heap and ocmmunicate via messages; a crashing Rust thread can corrupt the entire state, where as a crashing Erlang process is much more limited in its damage.

Also: overhead of a wasmtime runtime is < 50% slowdown right? So one crazy possibility is, if I can divide the server into little parts that each require< 4GB memory, then run each part in its own wasmtime, which limits how much damage a single crashing Rust 'part' can do.

That already exists: GitHub - lunatic-solutions/lunatic: Lunatic is an Erlang-inspired runtime for WebAssembly

Fine grained, Erlang-style, restarts aren't really possible in Rust(without recreating BEAM) for the reasons you stated, shared state. To build robust systems it's instead very common to use OS processes as the restart unit. Here's what I do in Rust:

  1. Design your system with the understanding that the OS process can die at any time. This can happen for a variety of reasons: compiler/runtime bugs, OOM killer, Rust panics, someone tripping over the power cord, etc.
    What this means depends on the type of application. For web apis, design your system so each api call can be handled by a completely different instance. All of the state shared by separate api calls has to go somewhere else, like a database, redis, etc.
    For long-running batch programs, design them so that any temporary/intermediate files, db records, etc don't interfere with retries. For very long-running processes(days) you can occasionally persist your state to retry from.

  2. Only panic in exceptional situations not handled by your program logic. For example, avoid unwrap() or except() unless it's a situation you think is "impossible". I set overflow-checks = true to enable overflow checks in release builds. I don't typically consider integer overflows when writing my logic so if I overflow it's a situation I haven't considered so the only reasonable thing to do is panic and restart from a known good state.

  3. Set panic = "abort"in Cargo.toml. In #2 we said panics should only occur when we encounter an error we can't handle. When we encounter these, kill the OS process. This is safe because of the architecture we designed in #1.

  4. Delegate restarts to a higher-level system. I typically use containerized applications so I would use Docker/Kubernetes. For non-containerized workloads you could use daemon-tools/systemd/etc. As a result when we encounter a panic, the entire OS process is killed and then restarted by the external tool.
    You should be able to monitor when your system dies from a panic by looking at logs or metrics provided by your application or supervisory system.

The result of this is system robust to unexpected restarts such as panics.

4 Likes

It doesn't provide a guarantee, but the judicious application of fuzz testing a la cargo-fuzz — Rust application // Lib.rs might help catch some of these issues before they crop up in production.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.