Rust, safety, performance under pressure

Hmm, but then any function that could be called from both regular code & drop handler would then have 2 versions? :slight_smile:

Only you can decide if that's worse than a crash.

Yes, there are different acceptable levels of failure between userspace and the kernel.

Making my post I was referring strictly to a typical user application.

Services, daemons and the like are another level, and of course the kernel and drivers another.

I've disabled overcommit when working on memory constrained ARM devices before.

Writing Rust the Elixir way | Lunatic Blog might actually be the solution.

If I am understanding this correctly, in this model, each 'light weight process' is running in it's own wasm sandbox, so they can crash independently without taking the rest of the system down.

Anyone with real world experience with lunatic ?

That assumes that no other processes are running on your computer, right? I suppose if not crashing is this critical then maybe buying a few extra computers is reasonable?

Consumer web traffic can be spikey, especially when running on "elastic cloud".

Handling this properly == some dropped connections until more machines spun up.

Handling this poorly == some servers crash; then if everything is load balanced, other machines get more traffic, and they crash too.

I dunno, philosophically, this idea of "sudden traffic spike => memory alloc spike => crash" seems very bad and the "correct" solution should be: machine operates at peak throughput, drops other connections.

I presume the key is to not handle the general case, right? And that moves the idea from unsolvable to inherently solvable. It's easy to identify functions that don't allocate, so putting an upper bound of zero isn't unsolvable, it's trivially solvable.

Just as the borrow checker doesn't need to solve the hard problem of identifying whether a program violates the borrow rules. Instead it just determines whether that program is in a particular easy-to-identify subset of the programs that don't violate the borrow rules.

Over could add a feature to rust (presumably after const generics land) to annotate functions with a maximum memory use, and then the compiler could refuse any implementation that it can't prove obeys said maximum. I don't know that this would be worthwhile, but I can't see how you could claim that it's impossible either.

3 Likes

Most of halting-problem-style-impossibility results requires that the 'decider' (compiler in this case) eventually output {yes, no} accurately. If the compiler is allowed to output: {yes, no, maybe}, the problem is definitely solvable. (Trivial case: output 'maybe' for everything, then refine to output 'yes' for trivially provable).

I don't have any practical experience in this area, but what I'd try to do for web servers specifically is to just avoid global allocator altogether, together with std collections, and use a bump allocator/scratch space pattern instead.

When a new request comes in, I'd create a Bump and pass it down to request handling function. The handler would only ever allocate from this space, using bumpallo-specific collections (which are fundamentally different from std's ones, as they don't drop their contents). That's quite a big ask, because the large chunk of std (and crates.io) becomes useless. At the same time, it seems that abstractly, it is worth it, because batching up all deallocations todether feels like the right thing to do, architecturally.

If you do go this way, you'll be able to handle memory pressure gracefully:

  • at the start of each request, you'd probably want to mmap/freelist a couple of pages for a bump allocator, and that would be an appropriate place to reject request early on
  • you don't need to bound requests memory requirements up-front though -- you can do additional allocs inside, and, given that you are re-implementing alloc-related APIs anyway, you'll be able to return Results / unwind.

The linux kernel team is considering allowing Rust in drivers, and in drivers/kernel abort on memory allocation failure would not be allowed.

Rust is adding support for fallible allocation in core and std:

OOM handling in Rust is currently pretty bad. I work around it with:

  • fallible_collections crate, which adds try_push()?, etc. It needs to be adopted manually. Use of aborting Vec by dependencies is still a problem.

  • cap allocator which can track how much memory is allocated by Rust. This lets me check in strategic places if server's memory use is currently too high, and reject new work, abort big tasks.

  • tokio's Semaphore which controls concurrency of async code. I can limit amount of tasks performed by the server to a conservative limit that should usually fit in memory.

But none of this is easy or reliable enough, so I can't wait for oom=panic to be stabilized.

2 Likes

#[alloc_error_handler] is only allowed to abort() the process (panic in it is forbidden), so it's not solving any problems here :frowning:

This post was flagged by the community and is temporarily hidden.

It's been a while since I looked at OS internals, but paging should take care of that problem, as I recall. All you need is to assure enough swap space for all the processes. Performance gets bad once you start paging a lot, but you can prevent one process from causing an out-of-memory error on another one.

Sry, but I feel the need to comment on this since it’s so obvious that you’re trying to use the two terms synonymously here: technically paging and swapping are different things, both can also be done without the other.

Of course in out OSes nowadays where we have both, they usually are combined into a unified mechanism, so a synonymous use is probably not too wrong either.

It still sounds off to me. Especially since “paging” alone / virtual memory is actually part of the problem in that the OS can, similar to banks or internet service providers, promise you more memory (or cash withdrawal or bandwidth) than what would actually physically be available if all processes (or customers) were to try and access all of what they (think they) have available at the same time. Swapping to disk possibly partially solves this problem by turning some of the non-existing memory into existing-but-really-slow memory.

1 Like

Indeed. It's also good to keep in mind that not all systems are configured to use swap.

Anyone familiar with how https://www.redox-os.org/ handles this issue?

I think on Linux you could avoid this with memory control groups, it now occurs to me. That could prevent processes outside your control group from consuming memory dedicated to your process, if properly configured.

At least in the past the kernel straight up crashed with a panic in case of OOM. In addition a crashed process would leave at least part of the allocated memory leaking. The later problem has been solved now as far as I know, but the first problem hasn't been solved yet as far as I know.

I though most Linux systems now a days incorporated an Out Of Memory Killer that terminated processes when memory was getting in short supply. Using some kind of heuristic(s) to determine which run a way process to kill and avoid killing essential services. I always wondered how well that works out.

My typical experience is that when memory use goes crazy a lot of swap space starts to get used. At which point the machine becomes so slow as to be unusable, it's no longer doing what it is suppose dap do, and a reboot is called for.

Likewise when running out of file system space.