Concerns with Rust on the Server


#1

Hi Rustaceans,

I want to talk about Rust’s failure modes. This discussion isn’t technical enough for Github and not interesting enough for reddit so I hope I can raise it here.

There is drive within the Rust community to push for Rust on the Server. Which is awesome for me and my business. However I feel one of the most important things about building server systems is being ignored or at least de-prioritised - and that is the story around building reliable systems with Rust. (My belief that it is being de-prioritisation is based around the lack of a mention of the concept in the 2016 Rust Conf, github issues (current), reddis links, user groups, libraries and the setting-our-vision-for-the-2017-cycle thread etc. - I am not saying the community doesn’t think its important - I am coming from the position that we are all competent)

Even with Rust’s amazing compiler bugs and faults are going to happen. So as competent developers we factor that in to the design of the system. Its very possible to deploy a 5 9’s uptime system in Brainfuck, javascript, erlang or Rust. So I’m not worried about Rust in the space. It will perform as well as any other language. My concern is that it is missing an opportunity to be an amazing language for building systems - in the same way that erlang is an amazing language for building system. Anyone who uses erlang seriously (think telecoms and banks) doesn’t use Erlang because it has green threads. They use it because how laughably simple it is to build reliable systems.

As an example it is not uncommon to stumble over a erlang system with a years worth of uptime. I say stumbled over because people forgot it existed even though its a really important component (and organisations suck). The error log directory of this node might be 2GB in size from all the error reports from some bugs in some component - bugs that would have brought down a rust, java, etc system. Regardless the node kept servicing requests within its designed SLA. As far as business is concerned that is the perfect server.

I want to be able to do the same thing in Rust. There are ways of course but its not simple and its not awesome. I am looking for other people who are in this space and thinking about these issues and I would like to help make it a reality. I am willing and able to devout a lot of my companies time to this - because ultimately I want to use Rust’s compiler because then I can have systems with years of uptime and a 0-byte error log directories. So if anyone has links, or wants to chat please feel free to get in touch :slight_smile:

In the mean time I would like to give a shout out to hansihe and the Rustler crate which makes putting Rust into Erlang production systems a real joy.

EDIT
Things that Rust gets right in the server space:

  • Compiler that keeps bugs low in the first place.
  • Types that encourages good error handling i.e. Option, Result.
  • No runtime and the removal of green threads - this is simpler and simpler contributes to reliability because less can go wrong. Fast startup time also helps recovery.
  • Saf(er) shared data
  • No GC pauses
  • Libraries with good design like Tokio (i.e. backpressure)
  • plenty more…

Things Rust doesn’t have a good answer for:

  • Fault Recovery - current accepted practice is to delegate to the OS
  • Fault Detection - my options are channels or thread.join or catch_unwind - these are not great primitives to work with.
  • Process/Fault Isolation - Channels create explicit links between processes. This makes isolating faults difficult and makes recovery harder (if the sender dies the system must restart, if the receiver dies it can recover).

#2

Error handling was a huge part of Tokio, which was given a full talk at RustConf. It included examples of talking about things like backpressure, etc.


#3

Yes, sorry for over looking Tokio - something I am evaluating as part of my own projects so I am aware of it


#4

Though if you are revering to the Back to the Futures by Alex Crichton talk there was only one slide where he talked about backpressure - so perhaps thats why It didn’t stick with me. Never the less my ommission is un defensible :slight_smile:


#5

Here is an example of something that is difficult to do in rust: Exception safety in work-stealing thread pool


#6

If you use Result, an error is just a value, which gives you total control to determine what your program will do in that failure case. There are definitely libraries to be written to make it easy to do the ‘right thing’ with error cases in the server space (and tokio is a part of that), but I don’t think there’s anything actionable at the language level.

Is there anything specific you have in mind to improve error handling for servers that need high reliability?


#7

panics and double panics (== aborts) happen. Reliable system should have an isolation boundary which should allow to recover from such cases.


#8

You’re right, which is why we have catch_unwind.

How would you propose to recover from a double panic?


#9

Don’t know :slight_smile: Another thorny issue is resource exhaustion: what if a task is stuck in the infinite loop, or eats all the memory? Perhaps we can just spawn a process?

One interesting observation is that you don’t need green threads, if the number of actors is bounded by the number of architectural components, and not by the number of clients.

So I imagine that it should be possible to build systems roughly as follows:

There is one supervising OS process, who monitors everything.

There is one frontend process, which uses async IO and catch unwind to receive connections from unbounded number clients.

There are N backend threads/processes which do actual processing and which just crash and restarted by the supervisor. N is some architecture-dependent constant multiplied by the number of CPUs.

Basically, OS threads/processes as an isolation boundary.


#10

There was once a plan to have a secondary unwinding mode that just aborts the thread and frees it’s memory without running destructors. I didn’t particularly like it though. I think niko has some fanciful ideas as well.

As @matklad suggests though I think dealing with double panic, oom and other unexpected catastrophes should be done most reliably with process isolation. It would be great if there was an easy to use framework to do this. There are already some good cross platform tools available. Something built on ipc-channel and gaol would be pretty solid.


#11

Yeah, my intuition was that double panic would be better handled by OS processes than by a language feature. It seems like what would be useful here would a framework for running your tokio services inside this sort of wrapper.


#12

There is one language level issues I think would certainly help.

  • Supervision trees a.k.a a task hierarchy a.k.a error encapsulation.
    The first language level feature to enable this would be recovery from panics.
    A second building block would be a mechanism for communicating failure to other processes a.k.a Fault detection and Fault identification - Standard Error/Result types will do for identification of course

Things that rust already do that help:

  • Fast startup times thanks to no runtime
  • No GC pauses
  • The Option and Result type (Shout out to ErrorChain!)
  • Avoiding shared state (or at least doing it safetly :heart:)

Features that may be best delivered as libraries:

  1. Processis Identification/Location transparency
    Unique unforgeable process ids. If I know the process Id I should be able to communicate with the process.
    Id could be stable across process restart boundaries. Compare this to current channel implementation.
  2. hot code upgrade - This is not a big one for me. I think its an abused feature of erlang
  3. Protocols, Protocols, Protocols, contracts etc - business as usual
  4. IPC - channels are ok. my opinion is that the idea of a addressable mailbox is a better metaphor (with backpressure of course)

A quote on OS level process isolation that sums up my feeling about it:

The only safe way to execute multiple applications, written in the Java programming language, on the same computer is to use a separate JVM for each of them, and to execute each JVM in a separate OS process. This introduces various ineeciencies in resource utilization, which downgrades perfor- mance, scalability, and application startup time.
Czajkowski, and Dayn`es, from Sun Microsystems

Of course rusts no runtime helps here but there is an operational overhead to running more than one application. Even in 2016 apparently this is an issue for some organisations.


#13

My thoughts on double panic are this:

  1. the higher up the task hierarchy you go the simpler the task is (the less chance it will panic)
  2. if it double panics let it fail. If we are OOM lets fail as quickly as possible. But if some stupid bug in my logging component causes 0.01% of my responses to panic we should recover with as much state as is possible.

#14

If you’re not aware, we do have the ability to recover from a panic. I think between channels & Thread::join we have the necessary support for communicating failure between threads also.

Double panics are upgraded to abort, which is all that we don’t have the ability to recover from, but you posted that you’re fine letting a double panic fail.


#15

These mechanisms are close but no cigar. I am aware and have tried implementing supervions trees with them. I get the feeling that catch_unwind is for FFI use cases i.e. a way to catch errors and then shuttle them across FFI boundaries. Thread.join is blocking so how do a build a many_to_one supervisor out of it? I need a channel and another thread…it gets ugly quickly


#16

I had looked into making Thread.join a non blocking operation because this would unlock a lot of the issues. Alas my unix system programming is not up to scratch


#17

Anyone who does use Erlang also knows that there is a range of things its entirely unsuited for,
and that includes anything computationally intensive or close to hardware.
To be as reliable as Erlang you need to design
the language and libraries for it, and that means making many potentially risky things impossible,
and I think that would be to high price for Rust to pay. There might be a room for improvements,
I’d like to see some reliability-focused frameworks, but ultimately Rust must be practical.


#18

Hey @Fiedzia I agree with your point that reliability is either a language level concern or its not i.e. its relegated to the OS/higher order architecture concern. I disagree that Rust will loose something by incorporating building blocks into the language that make building server systems easier. They don’t have to look or feel like Erlang but they are needed and they are practical


#19

I disagree that Rust will loose something by incorporating building blocks into the language
They don’t have to look or feel like Erlang but they are needed and they are practical

It doesn’t matter if they look and feel similar API wise, what matters is if they provide the same guarantees.
To do that you must not just add things to the language, but you have to remove many.


#20

Making the language simplier is indeed one of the mechanisms of improving reliability. One of the reasons Rust is awesome is because it removed green threads and simplified the whole thing. I am not asking for a return of green threads or the introduction of complicated runtimes or anything like that. So I think that means we agree.

What I am asking for though are more basic building blocks that any reliable system is going to need and my hypothesis is that this is possible. I refuse to believe that I have to introduce complexity at the OS oprations level in order to build safe reliable server system in Rust.

Process Isolation
Fault Detection
Fault Identification
Fault Recovery
etc.

I really should have avoided even mentioning Erlang in this thread - its only relevant in that it is an example of whats possible if suitable building blocks exist.