Rust, safety, performance under pressure

Say you're the CEO of a large corp, and a bunch of VPs report to you and you assign tasks to them. Say there are two types of VPs:

Type 1: At some point, this VP says: "too much work, I can't handle this new task"
Type 2: This VP always says "sure, I can do that", but, at some point, when overloaded, then just quit abruptly without warning.

We would both agree that Type 1 > Type 2. However, with my current knowledge of Rust, servers that I write tend to be of form Type 2 -- if, at any point, there are so many requests that we run out of memory -- the system just quits and dies. In my current code, I am never considering 'what if vector.push_back fails, what is hashmap.insert fails, what if FooBar::new() fails ... '. I do not think this is a problem the type checker / borrow checker can solve, because the programs are perfectly typed, it is just a matter of "$&@#*# out of memory"

Instead of 'sudden death', I would prefer a system that (1) handled as many requests as possible while sending out "hey, temporarily overloaded" to other requests

====

I believe the Erlang solution to this problem is that instead of having a few VPs, we have hundreds of thousands of cheap-to-spawn VVVVVVVVVVVPs assign them precisely one task, and if any crashes/overloads, just respawn a replacement.

====

I know that Rust has efforts like

GitHub - archseer/enigma: An Erlang VM implementation in Rust
GitHub - lunatic-solutions/lunatic: Lunatic is an Erlang inspired runtime for WebAssembly
bastion - Rust

but it seems to me, the fundamental issue is, the only solutions to this problem involves one of:

  1. only using fixed amounts of memory -- some how write program in a way that, regardless of load, always uses some (large) fixed amount of memory

  2. really really damn good memory monitoring / prediction -- i.e. somehow able to upper bound "I can guarantee, in the worst, case, handling these n additional requests will use at most M memory"

  3. localize damage / crash oriented computing / cheap respawn -- i.e. some type of VM / virtual threads where virtual threads can independently crash without damaging the rest of the system, and really really fast/cheap respawning

  4. some cool technique I am completely unaware of

====

If you have personal experience (or know of good books/blog posts) on building Rust systems that, under pressure, 'perform at peak throughout, then drop connections', instead of 'suddenly dying', I would love to hear more. Thanks!

1 Like

Generally, running out of memory is a pretty harsh error. Rust doesn't even panic when it happens — it just aborts.

In mini-redis, we have a Semaphore to put an upper limit on the number of concurrent connections.

2 Likes

Interesting. Theoretically, could someone rewrite core, std, all existing libraries to handle this, or is there something embedded so deeply in the Rust 'runtime' (whatever that might be) that makes this impossible ?

Out-of-memory can be very difficult to recover from, but I mean, you could certainly have Vec and friends have fallible variants of their methods. In fact, this is a whole big discussion in the "use Rust in Linux" thing.

1 Like

As far as I can tell any incoming request to your web server or whatever you are doing can be boiled down to the following:

  1. You have some function.

  2. That function excepts some input.

  3. That function produces some output.

It also seems to me that there are two unsolved problems in computer science about proving:

a) How long will that function take to produce its output.

b) How much memory will it use in trying to produce that output.

Never mind that in trivial cases we can estimate both of those, usually in some hand waving way.

Having done a lot of work with embedded systems I have been forced to think about this a lot. In embedded systems memory was/is very limited. In any real-time system the time to result is critical.

Typical programming languages from ALGOL through Rust with many others long the way do not address these issues. By that I mean they will happily compile your code if it is logically correct, by whatever criteria they have, but they have nothing to say about memory consumption or execution time. They don't raise excessive use of memory or time as an error.

I have only ever worked with one language that did that, Lucol, as used by Lucas Aerospace to build avionics systems. The Lucol compiler would finish, if your program was correct, by reporting exactly the maximum execution time of your program and the memory it used.

As far as I can tell Erlang does not tackle this issue either. It take the approach that, sure we know things will fail, out of memory, out of time or whatever. Let's just spin them up again to hope it works out better next time.

2 Likes

These aren't unsolved; they've been proven to be uncomputable (at least in the general case). No solution can possibly exist within formal mathematics as we currently know it. Heuristics that come close in common situations looks like the only viable approach to these problems at the moment.

The alternative is to redefine what "computable" means, but any solution based on such a redefinition would be inaccessible to current computers. It would take some extremely exotic architecture to reach outside the bounds of modern computability theory.

8 Likes

Please excuse my sloppy use of terminology.

But I stand by "unsolved".

When you point out they are proven to be "uncomputable" I take that as "proven to be forever unsolved"

The practical result is the same.

I'm not sure it's even worth trying to deal with actual out of memory situations by anything other than aborting. Linux itself will start killing off applications until the situation improves. It typically starts with short-lived apps using a lot of memory.

I generally think it is important to use terminology that can tell the difference between "we know there is no algorithm" and "we don't know whether or not there is an algorithm", and unsolved is traditionally used for the latter.

10 Likes

One issue with user-space Linux is that the virtual memory allocator over-commits. A process (regardless of what language it is written in) can request more memory than is available, and the system call will complete successfully. The “allocated” pages are only physically allocated by the kernel when they are touched (first page fault), and if they cannot be honoured, it is at that point that the infamous OOM killer starts shooting. There is no standard mechanism to report the failure back to the process at this point.

That said, over-commit can be disabled, and stupidly large allocations can still fail (for a configurable level of “stupid”), and there are other OSes, so making it prohibitively difficult to handle allocation failure is not the best solution. A program that needs to allocate a large chunk of memory for a particular discrete task, from which failure is recoverable (log the failure/alert the user, abort the specific task being attempted, keep running) might reasonably want to handle it. And within kernel code itself, allocation failure causing a kernel panic is unacceptable.

1 Like

I don't disagree. Correct use of terminology is important when discussing the minutia of theory.

But we have everyday practical matters to deal with.

In a post above I mentioned the Lucol language and how its compiler would report the exact maximal execution time and memory use of ones program. That language was used with success in many safety critical avionics systems produced by Lucas Aerospace.

For a couple of years I was employed by that company to test such systems. I though that capability of Lucol was great.

On a later project I was involved with testing code written in Ada. Because of the US military mandate to use Ada or whatever.

In testing I found that code used a totally random amount of its allowed time slot. From almost nothing to 90%. The target was 50% and exceeding 100% was system failure.

When I showed these results to the software dev manager and asked if they could prove the allocated time was not exceeded the room when very quiet.

Luckily nobody had to fly in a plane with that control system as it was cancelled.

Where am I going with this ramble? I'm not sure. I love that Rust does such a brilliant job of ensuring memory use correctness. I have no idea how it could go on to say something about memory and time usage, for where it matters, like Lucol did.

  1. I agree that Linux killing off applications is the right approach.

  2. Note: Linux doesn't kill the entire OS / restart the kernel. It kills off individual applications.

  3. Back to the Rust server example, I would view the equivalence as:

"killing off applications" == "aborting some of the requests"
"killing OS / kernel" == "server dying"

In a way, when a Rust server is being overloaded, we want it to either reject requests or somehow "kill off in progress requests" -- but have the server stay alive. Whereas right now, the server quits, which is sorta similar to the kernel quitting.

1 Like

LKML is a bit too high volume for me to process. Are there any particularly insightful parts of that debate you would suggest I look into. I completely agree that these two problems look very similar.

Theoretically, memory-usage and time-usage are both "equally difficult" in that they are both undecidable.

Practically, my intuition is that timing is even harder because to get any type of tight (say within 10x) bound, you probably need a model of the CPU -- of L3, L2, L1 cache size / delayson misses, of how many registers, of how the branch predictor works etc ...; how multicore / process f---'s things up; context switching (if there are other apps running), etc ...

1 Like

You can always use the Fortran 77 strategy, static allocation at load time. Either the program failed to start or you would never run out of memory. That suggestion is only half in jest. A custom allocator should be able to grab a big chunk of memory and produce an error that can be handled by the application should you ask for more. For example, return Result if vec.push_failable() asks for more memory than is available.

I was think more about this, and even if we modified all of core/std/... to be Result<T, MemoryError>, I don't think it solves the problem for the following reason:

Drop functions are always succeed right? So if a drop handler aborts half way due to out of memory, all types of invariants may be violated (imagine: if the drop handler of something like Rc f-ed up).

The problem I fear here is -- even if core/std/... returned Result<T, MemoryError>, we have this situation that Drop handlers may, as part of cleanup, require temporary memory allocations. Once those fails, the Drop handlers fail, and all types of invariant are broken.

So it seems we would need a world where:
(1) drop handlers never allocate
(2) core/std/... return Result<T, MemoryError>

Yes.

A big problem that programmers have today is that the systems we are programming lie to us all the time.

Ask for a megabyte of memory with malloc and the OS says "sure OK". Use that memory as Boom! You are dead.

Write some code that says:

a = 1
a = 2

And no, most likely that a=1 is optimised away. Quite likely the a gets optimised away altogether as we never read it later.

Write some code that says:

if x == y {
    do_this();
} else {
   do_that();
}

And likely, speculative execution does do_this and do_that while waiting to see if x == y is true. Hello Spectre vulnerability.

And so on.

End result is that it is totally impossible to know what our programs are doing.

1 Like

In the olden days, the OS on our mainframe would suddenly get really, really slow every once in a while. The systems guy told me to be patient; the problem would resolve itself. He explained that the OS reserved one 4 KB page so it could clean up from certain problems. It had to use that one page for everything until it could free up some more. The system was paging 99% of the time. Sure enough, 5, 10, 15 minutes later things started moving again.

In those days of 1 MB mainframes, reserving more than one page would have impacted the common path. Reserving 1 GB might not have that big an effect today. If never crashing is so critical, never allocate more than half your available memory. That should leave enough for drop handlers.

1 Like

I think there's still the problem that Vec.push_back needs to know "is this normal code" vs "is this in a drop handler"? Because when we hit whatever the 'virtual boundary' is, we need the former to fail while the latter to succeed.

That's why I had vec.failable_push() in the application code. The drop handler can use vec.push().

1 Like