Here is an example of something that is difficult to do in rust: Exception safety in work-stealing thread pool
If you use
Result, an error is just a value, which gives you total control to determine what your program will do in that failure case. There are definitely libraries to be written to make it easy to do the ‘right thing’ with error cases in the server space (and tokio is a part of that), but I don’t think there’s anything actionable at the language level.
Is there anything specific you have in mind to improve error handling for servers that need high reliability?
panics and double panics (== aborts) happen. Reliable system should have an isolation boundary which should allow to recover from such cases.
You’re right, which is why we have
How would you propose to recover from a double panic?
Don’t know Another thorny issue is resource exhaustion: what if a task is stuck in the infinite loop, or eats all the memory? Perhaps we can just spawn a process?
One interesting observation is that you don’t need green threads, if the number of actors is bounded by the number of architectural components, and not by the number of clients.
So I imagine that it should be possible to build systems roughly as follows:
There is one supervising OS process, who monitors everything.
There is one frontend process, which uses async IO and catch unwind to receive connections from unbounded number clients.
There are N backend threads/processes which do actual processing and which just crash and restarted by the supervisor. N is some architecture-dependent constant multiplied by the number of CPUs.
Basically, OS threads/processes as an isolation boundary.
There was once a plan to have a secondary unwinding mode that just aborts the thread and frees it’s memory without running destructors. I didn’t particularly like it though. I think niko has some fanciful ideas as well.
As @matklad suggests though I think dealing with double panic, oom and other unexpected catastrophes should be done most reliably with process isolation. It would be great if there was an easy to use framework to do this. There are already some good cross platform tools available. Something built on ipc-channel and gaol would be pretty solid.
Yeah, my intuition was that double panic would be better handled by OS processes than by a language feature. It seems like what would be useful here would a framework for running your tokio services inside this sort of wrapper.
There is one language level issues I think would certainly help.
- Supervision trees a.k.a a task hierarchy a.k.a error encapsulation.
The first language level feature to enable this would be recovery from panics.
A second building block would be a mechanism for communicating failure to other processes a.k.a Fault detection and Fault identification - Standard Error/Result types will do for identification of course
Things that rust already do that help:
- Fast startup times thanks to no runtime
- No GC pauses
- The Option and Result type (Shout out to ErrorChain!)
- Avoiding shared state (or at least doing it safetly )
Features that may be best delivered as libraries:
- Processis Identification/Location transparency
Unique unforgeable process ids. If I know the process Id I should be able to communicate with the process.
Id could be stable across process restart boundaries. Compare this to current channel implementation.
- hot code upgrade - This is not a big one for me. I think its an abused feature of erlang
- Protocols, Protocols, Protocols, contracts etc - business as usual
- IPC - channels are ok. my opinion is that the idea of a addressable mailbox is a better metaphor (with backpressure of course)
A quote on OS level process isolation that sums up my feeling about it:
The only safe way to execute multiple applications, written in the Java programming language, on the same computer is to use a separate JVM for each of them, and to execute each JVM in a separate OS process. This introduces various ineeciencies in resource utilization, which downgrades perfor- mance, scalability, and application startup time.
Czajkowski, and Dayn`es, from Sun Microsystems
Of course rusts no runtime helps here but there is an operational overhead to running more than one application. Even in 2016 apparently this is an issue for some organisations.
My thoughts on double panic are this:
- the higher up the task hierarchy you go the simpler the task is (the less chance it will panic)
- if it double panics let it fail. If we are OOM lets fail as quickly as possible. But if some stupid bug in my logging component causes 0.01% of my responses to panic we should recover with as much state as is possible.
If you’re not aware, we do have the ability to recover from a panic. I think between channels &
Thread::join we have the necessary support for communicating failure between threads also.
Double panics are upgraded to abort, which is all that we don’t have the ability to recover from, but you posted that you’re fine letting a double panic fail.
These mechanisms are close but no cigar. I am aware and have tried implementing supervions trees with them. I get the feeling that
catch_unwind is for FFI use cases i.e. a way to catch errors and then shuttle them across FFI boundaries. Thread.join is blocking so how do a build a many_to_one supervisor out of it? I need a channel and another thread…it gets ugly quickly
I had looked into making Thread.join a non blocking operation because this would unlock a lot of the issues. Alas my unix system programming is not up to scratch
Anyone who does use Erlang also knows that there is a range of things its entirely unsuited for,
and that includes anything computationally intensive or close to hardware.
To be as reliable as Erlang you need to design
the language and libraries for it, and that means making many potentially risky things impossible,
and I think that would be to high price for Rust to pay. There might be a room for improvements,
I’d like to see some reliability-focused frameworks, but ultimately Rust must be practical.
Hey @Fiedzia I agree with your point that reliability is either a language level concern or its not i.e. its relegated to the OS/higher order architecture concern. I disagree that Rust will loose something by incorporating building blocks into the language that make building server systems easier. They don’t have to look or feel like Erlang but they are needed and they are practical
I disagree that Rust will loose something by incorporating building blocks into the language
They don’t have to look or feel like Erlang but they are needed and they are practical
It doesn’t matter if they look and feel similar API wise, what matters is if they provide the same guarantees.
To do that you must not just add things to the language, but you have to remove many.
Making the language simplier is indeed one of the mechanisms of improving reliability. One of the reasons Rust is awesome is because it removed green threads and simplified the whole thing. I am not asking for a return of green threads or the introduction of complicated runtimes or anything like that. So I think that means we agree.
What I am asking for though are more basic building blocks that any reliable system is going to need and my hypothesis is that this is possible. I refuse to believe that I have to introduce complexity at the OS oprations level in order to build safe reliable server system in Rust.
I really should have avoided even mentioning Erlang in this thread - its only relevant in that it is an example of whats possible if suitable building blocks exist.
I’m just not sure if these things are language concerns or library concerns for Rust. Debating that heavily influences where Rust will end up.
Is there currently something missing in Rust that would block implementing - for example - an actor system like akka as a library, safely and well-integratable in to the rest of the eco-system? Sync and Send markers are there, Clone, Copy, mut and not and to my knowledge, that brings you a long way.
panics are a problem, but that’s the reason why panics should be avoided in library codes.
I think you got this wrong. Erlang is not reliable because those things exist, but because many do not exist.
Erlang processes communicate via message passing only. Rust allows infinite amount of ways to do it (threads, locks, low-level OS primitives, hardware features to name a few). Erlang guarantees that a process cannot be stuck in a loop,
because it doesn’t have loops. It cannot be deadlocked because there are no locks in the platform.
And so on. So if you’d have a large system created in rust that crashes or locks say 5 times a month and requires manual intervention, you will not make it crashing/locking any less by adding anything to it. The only way would be to remove causes of those problems. All the things you mentioned rely heavily on guarantees provided by the language to be reliable.
This is false. Erlang expresses infinite loops through tail recursion (which are equivalent in expressiveness) and even bases one of its base programming models (receive/work/tail recursion) entirely on that.
Erlang comes with a global lock module in the stdlib and a lock profiler in the VM.
Erlang gains it’s resilience through an (almost)* shared nothing, a crash early approach and the idea that failing components should not recover themselves.
These three concepts are very well implementable in Rust as it is. Rust being a lower-level language then Erlang kind of prohibits baking the runtime platform support into the language, but I think the current set of primitives lends itself very well to implement these approaches.
- Binary blobs are shared in Erlang, a common source of surprise.
Erlangs reliable is a result of very precise and carful consideration from Joe armstrongs thesis on what things should be present. It doe not mention removing things. It makes explicit exactly which abstract things are necessary to create reliability. He then goes on to create Erlang as an concrete example of one type of system that meets the goals thesis. You do not need an actor model for porcess isolation. you do not need immutable data to prevent data sharing. you dont need message passing to implement process isolation. you dont need supervisors to implement fault recovery. You do need ways to implement the stuff I have been talking about. Currently some of those things are being delegated to the Operating system and I think this introduces complexity and therefore degrades reliability.
p.s You can very easily put an erlang process into a loop. you can very easily lock up an entire scheduler thread, create data races, or deadlocks etc. All these things are possible in Erlang despite the design choices it made.
Something of interest here is that most erlang systems forget about backpressure and so often fall over under load because of bad design. Its awesome to see Tokio/futures build back pressure in from the very beginning
This is the second major part of his Thesis. Let it fail and recover gracefully…because it is impossible to remove all causes of problems.
Same as above. Code will panic. Networks will crash. Hard drives will fail. Bugs will happen.
Yip. Most things are library concernces. But things like fault recovery are very much linked to language or at least to the std lib