When will the Global.allocate fail?

The sample code is on
https://play.rust-lang.org/?version=nightly&mode=release&edition=2018&gist=4b58a8672cd975f292e2e2be61ce7a7d
This sample code asks for allocation of 320G memory. If the allocation fails, the program will output "hello world" and aborts. If the allocation succeeds, the program will output "world hello" and try to write to the allocated memory.
However, when I run this code on rust playground, the program neither outputs "hello world" nor "world hello" and just aborts. I wonder why this will happen?
When I run this code on my PC (Macbook pro with 16GB RAM and 512G SSD), the output is "world hello", and the memory quickly runs out(the swap memory rises rapidly, starts to write the memory). My PC doesn't appear to have enough memory. I can think that memory overcommit happens but that means I fail to deal with the Out of Memory situation. Since Global.allocate returns a Result to deal with allocation failure, why does not this allocation fail?

It prints hello world on playground before abort. Note that it's not enforced for every allocator implementations.

Implementations are encouraged to return Err on memory exhaustion rather than panicking or aborting, but this is not a strict requirement. (Specifically: it is legal to implement this trait atop an underlying native allocation library that aborts on memory exhaustion.)

https://doc.rust-lang.org/stable/std/alloc/trait.Allocator.html#errors

On Linux the kernel by default over commits memory. This means that it pretends that there is more memory available than there actually is. Only when you actually try to read or write the memory will the kernel actually allocate a physical page. If there is not enough memory, the OOM killer will kill one of your programs. Normally the program with the highest memory usage. In this case your program.

1 Like

nitpick: it actually has to be a write. Linux will COW anonymously allocated virtual memory.

As for Rust (or any other program in Linux running under memory overcommit), there's no way to recover or handle memory exhaustion meaningfully. It will occur randomly far from where the memory was allocated and lead to the process's termination with SIGKILL.

Depending on your use case, you could try using allocate_zeroed perhaps which appears to write to the memory. This would still not cause the alloc call to fail, but at least the program would crash before returning from the alloc call.

PLEASE stop bringing up OOM killer in every discussion about OOM handling, because this persistent myth is the reason why Rust is so awful at handling OOM.

  1. Linux is not the only operating system in the world, and even in Linux this can be tuned or disabled.
  2. Significant overcommit happens only when the application uses forking and other copy-on-write pages. Even Linux with overcommit enabled can still fail allocations that it knows it can't satisfy.
  3. rlimit and cgroups can limit memory available to the process before the OS runs out of memory, so OOM killer won't be involved at all.
  4. Rust allocators (like cap) can still choose to arbitrarily limit memory, independent of the OS. This is very useful to reduce likelihood of overloaded/buggy services destabilizing the whole machine. If Rust wasn't so nihilistic about OOM handling, this would also work great to keep services from hitting their cgroup memory limit and getting killed.

OOM can happen, and could be handled properly by Rust, if only Rust didn't give up on it because of the speculations about OOM killer.

2 Likes

I am not doubting the technical merit of your argument; but I don't think it makes sense to direct this as a reply to @bjorn3 's answer. I found @bjorn3 's response informative / insightful, and I don't see how it is possible to talk about a program killed by the OS for triggering OOM without mentioning "OOM Killer"

Not all cases of a program being killed are related to the OOM killer. OOM killer is a very specific mechanism for recovery from a system-wide critical situation, and not a thing that just happens whenever a program runs out of memory.

Programs can be killed by the OS directly, not via the OOM killer, when they exceed their memory limits assigned by things like containers.

And Rust programs using libstd will also abort themselves voluntarily (not by the OS) when libstd encounters out of memory situation reported by Rust's own allocator. This usually looks like program being killed, even though it's not.

I'm not familiar with the background or prior discussions here, but I don't see anything specific to Rust - could you explain.

By definition, virtual memory is outside an application's control. An application cannot tell if an allocation of virtual addresses given to it by the OS is now (or in the future) backed by physical memory, or can be backed by physical memory at some time.

If the underlying OS (like Linux in many configurations) adopts an overcommit policy (like many current distributions do) and needs to eject a passenger it will do so without giving the process any chance to react - whether it's a Rust program or, for instance, a node.js program.

I don't find anything Rust specific here, could you elaborate on how you think Rust could handle this situation?

Did you mean to write "or"? forking is not required for significant overcommit, a simple malloc loop will do.

Are you referring to how memory limit exhaustions are handled in cgroups?
My understanding was that both v1 and v2 basically use the or an OOM killer, for instance, oomd. But the effect will be the same, won't it? In that the process will have no chance to react.

Perhaps I don't understand what distinction you're drawing between "OS directly" vs "the OOM killer"

Interesting. I think I misunderstood the role of the OOM killer. At a high level, are you saying the following: ?

  1. somewhere in the linux kernel, there is a module/function called invoke_oom_killer() , which gets invoked when the kernel needs to reclaim memory

  2. a Rust program that tries to allocate too much memory may get killed by something else even before invoke_oom_killer() is called

  3. therefore, "my rust program allocated too much memory and died" should not
    automatically imply "invoke_oom_killer() did it"

The kernel function is out_of_memory.

rlimit (which is behind ulimit) is one non-OOM example already given where you may get killed, or allocation may "just" fail (and panic in the case of infalliable allocators). Virtualization/containerization is another. Having too many allocations versus too much is another (max_map_count). These other mechanisms can be in play even when overcommit is disabled (which is itself a situation where allocation may fail).

So yes, there are many allocation-related ways to get killed/have your infalliable allocator panic.

If your process does get OOM killed, it doesn't necessarily mean it allocated too much either, just that it was the chosen victim (the ejected passenger as per the analogy linked to earlier). The victim is usually a memory hog, but need not be. Maybe some other process with a lower OOM score just went crazy.

2 Likes

So where does the feeling that Rust is "awful at handling OOM" come from? Are there places where the allocator doesn't relay actual allocation failures that occur at the time virtual memory addresses are being allocated (rather than asynchronously later as in OOM-killer situations)?

Doing a bit of research, I find

https://rust-lang.github.io/rfcs/2116-alloc-me-maybe.html

and

So the issue is that Rust currently aborts rather than panics even in situations where the underlying system signals a failure at allocation time.

I linked to my own prior work in a different thread. I'll briefly repeat it here. Handling OOM (the kind where an allocator detects a failure at allocation time) is very similar to handling asynchronous task termination in that it's difficult to predict where in the code the OOM situation occurs. A potential approach is to adopt a user/kernel process model. Inside the kernel, allocation failures will not occur (this is guaranteed via reserves or preallocations); outside the kernel, allocation failures will lead to the termination of all related threads inside that "process" without unwinding but with a complete reclamation of all memory in use by this process (=group of threads). This is tricky to implement in general, but Rust (due to its strict control of interobject relationships in safe code) may be the language with the best potential to pull this off.

Yes, the awfulness is from hardcoded abort() in libstd, and code around allocators and oom handlers marked as nounwind, so they're not allowed to panic either. This means that even with care it's not feasible to write a Rust program that handles OOM gracefully and reliably (unless you give up on all of std and std-dependent crates, which is a big loss).

I'm very annoyed about this, because IMHO OOM should have been just a regular panic. It probably would have been designed to be a panic if it wasn't for the "but OOM killer makes it all pointless anyway" talk.

People also assume that OOM error handling is very difficult and is going to be broken anyway. I think this view is based on experiences from C and C++ where error handling/exception safety is indeed difficult and brittle. But they're not Rust. In Rust, drops on panic are automatic. Almost all of them just release memory without allocating more. Panic safety exists is a problem in Rust, but only for certain patterns in unsafe code. Panic during panic is already an abort, so it would be fair to abort if OOM happens during unwinding (with exception that libstd should pre-allocate memory it needs to start unwinding).

I partially agree, perhaps mostly. I don't think it's just C and C++, though. It's an experience with languages in general. For instance, in Java, theoretically, unwinding and handling OutOfMemoryError should work. In practice, it quickly tends to leave any reasonably complex system quickly in a inconsistent state, both at the JVM level and often at the middleware level (think a servlet engine and the like). And this in a language that unlocks its mutexes and doesn't have the poisoning idea Rust has.

Now Rust may indeed be special in that it doesn't have general exceptions (catch + finally) and drop implementations are the only code that's run on unwind, and it's often structured and limited rather than a general-purpose finally. I share your optimism here as I indicated previously. However, it's not clear without experience. Think about poisoned mutexes that can occur, for instance, which general code is often not prepared to handle. Think about the possible interactions with unsafe/ffi code.

Because of this, while necessary, it's insufficient to just have a fallible allocator and the ability to unwind. You need to reason about which sections of the code you're trusting to handle this condition (either by unwinding correctly, or by not triggering OOM in the first place like in the try_reserve proposals). And then, a process-like abstraction could be useful (think C#'s application domains) so that you can reliably discard all state that should be reclaimed in an OOM event and then the remainder of the system can move on (the way an OS kernel moves on after a process is killed). Such an abstraction would also make it easier to impose memory limits in the first place. (Imposing memory limits for a specific domain of your application to guard against unexpected memory consumption seems to me the more important use case than trying to defend against the OOM killer, I think I agree with you here. This would be true even if the OOM killer didn't exist and you were trying to defend against process-wide OOM conditions.)

I kind of like the simplicity of 'let it crash.' Having an in program OOM handler implies having blocks of functions that are only executed during OOM situations, which sounds like a nightmare to debug.

In particular, I am really drawn to the idea of Erlang style Rust, where lightweight Rust processes are compiled to wasm and run in wasm VMs. During an OOM, that one Rust VM crashes, freeing up it's memory; it's supervisor gets notified, which then handles how to restart things.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.