Long running servers, detect 'might panic'

zeroexcuses · March 29, 2022, 4:43am

Disclaimer: I know about the halting problem. Let's not discuss that. We can avoid the issue because we allow "correct programs" to be labeled as "might panic."

Question: I'm writing long running servers in Rust. One thing I am increasingly concerned abut is that there is something that might panic / crash after a long time. For example: some u16 or u32 overflows. Does not show up in small unit tests, but keep the server running a few days or weeks, and BAM, sudden crash.

Does rustc have a 'paranoid' mode that highlights things that might cause overflows / crashes?

Again, halting problem does NOT apply here because we allow "correct programs" to be labeled as "might cause overflow".

Cerber-Ursi · March 29, 2022, 4:48am

Will the link-time check, like no-panic crate does, be of any help here?

kpreid · March 29, 2022, 4:56am

You could set #[deny(clippy::integer_arithmetic)] to deny the implicitly panicking/overflowing + - * / operators, then use only the checked_, saturating_, or wrapping_ arithmetic methods as appropriate for each particular usage.

This doesn't do any analysis of whether a specific numeric variable/field will or won't overflow; only ensures that no implicit-panic-or-wrap overflow (or divisions by zero) will exist in your program (except those that might be contained in libraries you call).

scottmcm · March 29, 2022, 5:20am

The best answer is the Erlang one: just allow it to crash, because you can't keep it from happening reliably. And have a small, clearly-correct program to restart it if it fails. (A Watchdog timer - Wikipedia, for example.) And have good diagnostics to fix the problem so it doesn't happen again once you know about it.

(Though multi-user services are one of the few acceptable uses of catch_unwind in std::panic - Rust, so that you can exit more gracefully, rather than just hard rebooting right away, for things that are just panics. But that doesn't solve other kinds of problems, like stack overflows, so being able to restart is still the best overall plan.)

2e71828 · March 29, 2022, 8:05am

To reframe the problem a bit, you're looking for a classifier that will divide the space of all rust programs into two sets, A and ¬A, such that all programs that panic on some input are in ¬A. Allowing programs that never panic into ¬A does avoid the halting problem, but introduces a new one: defining a useful lower bound for A.

Labelling all programs as "might panic" trivially meets these conditions, but is an unsatisfying answer to your original inquiry. The program fn main() {}, for instance, feels like it should be successfully identified as non-panicking. The interesting question then becomes what you mean by "might", i.e. Can we make the "definitely doesn't panic" set large enough to contain useful programs?

_{I don't have an answer, by the way. My intention here is to forestall useless semantic discussions about which programs "might panic"— Reasonable people can disagree about the meaning of that phrase, which is at the heart of your question.}

Finn · March 29, 2022, 8:23am

Purify u32. Subsequently, do not expose ordinary u32 anymore, so that your program does not become tainted as quickly.

#[derive(Debug)]
struct U32(Option<u32>);

impl From<u32> for U32 {
    fn from(x: u32) -> Self {Self(Some(x))}
}

fn bind_binary(x: U32, y: U32,
    f: impl Fn(u32, u32) -> Option<u32>
) -> U32 {
    U32(x.0.and_then(|x| y.0.and_then(|y| f(x, y))))
}

impl std::ops::Add<Self> for U32 {
    type Output = Self;
    fn add(self, rhs: Self) -> Self {
        bind_binary(self, rhs, u32::checked_add)
    }
}

fn main() {
    let x = U32::from(0xffffffff);
    let y = U32::from(1);
    println!("{:?}", x + y);
}

zeroexcuses · March 29, 2022, 8:28am

In theory, I agree with you. In practice, I would love to do this. However, it's not clear to me how to get the best of Erlang's crash-recovery philosophy and Rust's performance. Problems that come to mind:

panic while holding a Mutex
panic while modifying some state, and half way thorugh
Erlang calling Rust NIFs => not too useful, as panic in Rust can take down Erlang

Do you know of any real world systems that has managed to mix the best of Erlang & Rust ?

bjorn3 · March 29, 2022, 8:30am

Even this program can panic in practise if eg a memory allocation fails.

zeroexcuses · March 29, 2022, 8:32am

This is absolutely correct. The same argument can be made of type checking via:

fn f(h: turing_machine, m: input) {
  h(m);
  2 + "hello world"
}

if h(m) infinite loops, there is no type error; if h(m) halts, there is a type error. Thus, one could argue that type checking is as hard as halting problem.

Yet, despite this, we have very useful type systems (like Rusts) where some "programs w/o type errors" are labeled as "might have type error" and banned.

So I'm looking for something that "extends" the type-checker a bit in this direction.

Sorry for all this vagueness / analogies; if I knew precisely what I needed, I'd be using it, rather than asking if it exists.

bjorn3 · March 29, 2022, 8:32am

I believe in erlang all state is local to a task. Crashing brings down a whole task, so you don't have to worry about state inconsistencies.

2e71828 · March 29, 2022, 8:32am

Long before Rust, I used daemontools to manage this sort of thing for arbitrary programs.

zeroexcuses · March 29, 2022, 8:35am

I should clarify: the three bullet points are not meant as "problems of Erlang" but "problems of trying to do Erlang style programming in Rust" -- i.e. these two things can happens in Rust, and because Rust has a shared memory model where as Erlang 'process have their own heap and ocmmunicate via messages; a crashing Rust thread can corrupt the entire state, where as a crashing Erlang process is much more limited in its damage.

zeroexcuses · March 29, 2022, 8:38am

Also: overhead of a wasmtime runtime is < 50% slowdown right? So one crazy possibility is, if I can divide the server into little parts that each require< 4GB memory, then run each part in its own wasmtime, which limits how much damage a single crashing Rust 'part' can do.

bjorn3 · March 29, 2022, 12:06pm

That already exists: GitHub - lunatic-solutions/lunatic: Lunatic is an Erlang-inspired runtime for WebAssembly

dmm · March 29, 2022, 12:06pm

Fine grained, Erlang-style, restarts aren't really possible in Rust(without recreating BEAM) for the reasons you stated, shared state. To build robust systems it's instead very common to use OS processes as the restart unit. Here's what I do in Rust:

Design your system with the understanding that the OS process can die at any time. This can happen for a variety of reasons: compiler/runtime bugs, OOM killer, Rust panics, someone tripping over the power cord, etc.
What this means depends on the type of application. For web apis, design your system so each api call can be handled by a completely different instance. All of the state shared by separate api calls has to go somewhere else, like a database, redis, etc.
For long-running batch programs, design them so that any temporary/intermediate files, db records, etc don't interfere with retries. For very long-running processes(days) you can occasionally persist your state to retry from.
Only panic in exceptional situations not handled by your program logic. For example, avoid unwrap() or except() unless it's a situation you think is "impossible". I set overflow-checks = true to enable overflow checks in release builds. I don't typically consider integer overflows when writing my logic so if I overflow it's a situation I haven't considered so the only reasonable thing to do is panic and restart from a known good state.
Set panic = "abort"in Cargo.toml. In #2 we said panics should only occur when we encounter an error we can't handle. When we encounter these, kill the OS process. This is safe because of the architecture we designed in #1.
Delegate restarts to a higher-level system. I typically use containerized applications so I would use Docker/Kubernetes. For non-containerized workloads you could use daemon-tools/systemd/etc. As a result when we encounter a panic, the entire OS process is killed and then restarted by the external tool.
You should be able to monitor when your system dies from a panic by looking at logs or metrics provided by your application or supervisory system.

The result of this is system robust to unexpected restarts such as panics.

jwest23 · March 29, 2022, 12:07pm

It doesn't provide a guarantee, but the judicious application of fuzz testing a la cargo-fuzz — Rust application // Lib.rs might help catch some of these issues before they crop up in production.

system · June 27, 2022, 12:07pm

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
How to implement failure boundaries?	7	1392	January 12, 2023
Handling panic in rust lib	6	584	April 26, 2023
How to diagnose a `stack overflow` issue's cause? help	12	17070	September 5, 2020
Catching panics in rust web service? Is this a sound error handling strategy?	2	566	July 24, 2020
Pros and cons of std::panic::catch_unwind help	15	1601	December 30, 2021

Long running servers, detect 'might panic'

Related Topics