Current Solana Out of Memory panic issue

It looks like Solana has been nuked with spam bot transactions and validators are dropping due to out of memory panics, causing several hard forks. Is it possible that Rust may not be the best choice to build public facing services that people can exploit with spam requests? Based on what I am seeing, it's led me to ask here.

Unrelated to crypto, but similar technical question I asked earlier: Rust, safety, performance under pressure

TLDR: only real solution I can think of are (1) languages like Zig, where allocator is passed as argument or (2) no-std/embedded route, where you pre-allocate up front and avoid dynamic memory alloc as much as possible

Would you abandon Java because a program has a bug and throws an uncaught exception?

While it's a massive pain and really hurting Solana, these out-of-memory panics should just be treated like any other bug. You can resolve them by writing code that is more aggressive about releasing unused resources, doesn't allow unbounded memory growth, or has fixed memory usage (i.e. like embedded applications where you may not even have an allocator). There may also be systemic problems which enabled this situation (e.g. the Solana protocol might have scaling issues).

3 Likes

So long as there is a way to patch the code, yes. It was specifically Linus Torvalds questioning if OOM process termination was a fundamental issue and what could or could not bring it about with Rust. Combine this take with me seeing it live in action and here I am asking :smiley: . Its specifically this attached link that has me concerned as validators and functionality cant terminate during run for something as important as this-- especially if competing chains, institutions, or just trolls in general can cause overflows of memory by flooding with requests. Internal closed services wouldn't have this issue as access can be restricted. Restricting access to requests defeats the entire purpose of a public chain, so thats a non solution in this realm. Hence why I am asking this here for clarification.

https://lkml.org/lkml/2021/4/14/1099

We have a machine with finite memory and finite CPU.

When the # of requests increase without bound, what do you expect to happen ? Dropping requests / restricting access seems inevitable (the other route being to crash, which would serve 0 requests).

It is not clear to me at all how this is a Rust weakness, as regardless of the choice of language, one is faced with this issue.

1 Like

The OOM = abort discussion has already been done to death on the internet, but if you step past the FUD you'll see progress has been made towards handling allocation errors gracefully. The Allocator API methods all return Results, and you've got (not yet stable) methods like Vec::try_reserve() or Box::try_new() which let you do normal Vec and Box operations while handling the possibility of allocation failures.

2 Likes

Dropping requests, yes. By restricting access, I meant more from an internal organization/intranet standpoint. I was actually reading up a lot on your topic linked above. This in particular is what is concerning me.

When you have spam bots cranking out 400k tps, it seems like the "sudden traffic spike => memory alloc spike => crash" scenario would seem extremely likely.

The potential weakness in choosing Rust would be that during this spike, the validator node would terminate and have a domino effect for the rest of the network.

Ok, so the overflow issue should be fixable. Good to know. Glad I haven't wasted my time.

I think the issue described above is an architecture choice, not a language choice.

If the architecture is: always allocate on every request, then sooner or later the server gets killed by OOM.

If the architecture is: pre-allocate block up front, and drop connections at limit, the server continues running at some capacity.

This choice seems orthogonal to choice of language.

5 Likes