Concerns with Rust on the Server

I'm just not sure if these things are language concerns or library concerns for Rust. Debating that heavily influences where Rust will end up.

Is there currently something missing in Rust that would block implementing - for example - an actor system like akka as a library, safely and well-integratable in to the rest of the eco-system? Sync and Send markers are there, Clone, Copy, mut and not and to my knowledge, that brings you a long way.

panics are a problem, but that's the reason why panics should be avoided in library codes.

1 Like

I think you got this wrong. Erlang is not reliable because those things exist, but because many do not exist.
Erlang processes communicate via message passing only. Rust allows infinite amount of ways to do it (threads, locks, low-level OS primitives, hardware features to name a few). Erlang guarantees that a process cannot be stuck in a loop,
because it doesn't have loops. It cannot be deadlocked because there are no locks in the platform.
And so on. So if you'd have a large system created in rust that crashes or locks say 5 times a month and requires manual intervention, you will not make it crashing/locking any less by adding anything to it. The only way would be to remove causes of those problems. All the things you mentioned rely heavily on guarantees provided by the language to be reliable.

1 Like

This is false. Erlang expresses infinite loops through tail recursion (which are equivalent in expressiveness) and even bases one of its base programming models (receive/work/tail recursion) entirely on that.

Erlang comes with a global lock module in the stdlib and a lock profiler in the VM.

http://erlang.org/doc/apps/tools/lcnt_chapter.html

http://erlang.org/doc/man/global.html

Erlang gains it's resilience through an (almost)* shared nothing, a crash early approach and the idea that failing components should not recover themselves.

These three concepts are very well implementable in Rust as it is. Rust being a lower-level language then Erlang kind of prohibits baking the runtime platform support into the language, but I think the current set of primitives lends itself very well to implement these approaches.

  • Binary blobs are shared in Erlang, a common source of surprise.
5 Likes

Erlangs reliable is a result of very precise and carful consideration from Joe armstrongs thesis on what things should be present. It doe not mention removing things. It makes explicit exactly which abstract things are necessary to create reliability. He then goes on to create Erlang as an concrete example of one type of system that meets the goals thesis. You do not need an actor model for porcess isolation. you do not need immutable data to prevent data sharing. you dont need message passing to implement process isolation. you dont need supervisors to implement fault recovery. You do need ways to implement the stuff I have been talking about. Currently some of those things are being delegated to the Operating system and I think this introduces complexity and therefore degrades reliability.

p.s You can very easily put an erlang process into a loop. you can very easily lock up an entire scheduler thread, create data races, or deadlocks etc. All these things are possible in Erlang despite the design choices it made.
Something of interest here is that most erlang systems forget about backpressure and so often fall over under load because of bad design. Its awesome to see Tokio/futures build back pressure in from the very beginning

This is the second major part of his Thesis. Let it fail and recover gracefully...because it is impossible to remove all causes of problems.

Same as above. Code will panic. Networks will crash. Hard drives will fail. Bugs will happen.

Yip. Most things are library concernces. But things like fault recovery are very much linked to language or at least to the std lib

This is important. Simplicity is not only beautiful but it also contributes to making the system as a whole more reliable

Sure, but you can isolate against such cases and reduce them. Networks crashing in Rust lingo will also not be a panic, but an error. panics are reserved for "this is fundamentally broken, I cannot go on or recover in any way".

Back to the question. Which primitives in Rust are missing to build such things?

To fault recovery: if you read the threads about panic and abort features, you may find that there's quite a couple of failure modes very specific to platforms, I'm not convinced I see that as a stdlib concern.

1 Like

The biggest thing is a way to recover from faults inside the process rather than delegating to the OS/platform.
This allows me to create hierarchies of tasks and allows faults to propagate up the tree until it reaches a place where graceful recovery can occur. This is better than letting the OS handle the recovery i.e. via restart because I may be able to maintain state. I also isolate the fault to one process/supervision tree rather than bringing down all my processes and restarting. A basic primitive for this would look similar to catch_unwind. This for me a is something that should exist in std::lib.

Other primitives I would like to see are ways to communicate that obey the idea of process isolation. Currently channels are tied to the Receiver and so if it fails the channel becomes useless and you need some mechanism to renegotiate a channel. I image most Rust application just panic at this point and let the OS restart the whole process. Which is fine. Its just not awesome. This for me is a library concern.

Take Servo as an example. In particular it has a class called Constellation. This is a good example of what complexities you have to introduce into a program in rust if you want a multi-threaded app that is resilient to individual components crashing. It is a central place "Servo's Grand Central Station" where most channels are anchored - therefore if some component crashes it can restart and request a TRX channel without much fuss. Its also got logic for spinning of threads and "supervising" them. (I am prepared to stand corrected here as I am no Servo expert)

Servo would benefit a lot if there where better building blocks for these types of concerns.

So you want supervisor trees in std::lib or being allowed to use catch_unwind outside of the FFI boundary?

catch_unwind in std::panic - Rust documentation is not fully clear about this, but FFI is one example where exception safety must be established, but there can be others. Unsurprisingly, futures-rs has a panic-safe polling method: http://alexcrichton.com/futures-rs/futures/trait.Future.html#method.catch_unwind

So, if you base your server on future-rs, you can base your server on futures that catch panics.

catch_unwind is just not meant as a general error-communication mechanism.

This is one of the reasons I had a bad feeling of adopting the current channel implementation. People might complain that there's not their specific flavour of channels. Yes, this is a library concern and Rust provides your primitive: Send. Which is a much stronger guarantee then many languages provide.

Servo is generally a bad example. It predates Rust 1.0 and is a huge application. Servo code might be a driver for future development of the Rust language, but also often be a remnant of times where approaches where researched.

I'm sure that everyone at Servo would be happy about a rearchitecturing, if this is a pain point.

Maybe @Manishearth has opinions.

I would like a catch_unwind like thing that I can use to build a supervision tree out of. So if rust provided a few things I could then build a task supervision library. So just stopping the unwind is step one. Now I need to identify what happen (catch_unwind helps here because it returns T but I would like to better understand what the guarantees around destructors are). The next step is then to propagate the failure to which ever part of the system needs to know. And then I need to recover. I have tried to build such a library based on Rust as it is today but it is not worth the effort and I also can't build many_to_one supervisors. I can only build simple_one_to_one supervisors. If you are not familiar with those terms its not important. The point is that it isn't really possible at the moment to build a supervision tree library

Picking on Servo and Constellation is difiinitly unfair :slight_smile: Its just an aside

The point is that it isn't really possible at the moment to build a supervision tree library

I'm curious. What keeps you from building one?

1 Like

I do understand those terms. But you still haven't given us a concrete thing that's missing for it. (TBH, don't understand what a many_to_one supervisor is, I assume you mean one of one_for_all or rest_for_one)

What I don't understand is what holds you back to write a supervisor tree. Rust is capable of building message passing-like systems, localising panics and avoiding crashes in these cases.

Can you show any of your current attempts as illustrations?

I'm sorry to be so frank, but you are giving a lot of comparisons you later backtrack on and not a lot of concrete missing things.

Let me acknowledge and take responsibility for the tone I have set in this thread. I made a mistake and will learn from it.

There is a lot of Erlang and other server related business out there that value ideas that are aligned with Rust's core concepts around Safety, performance, productivity etc. Rust wants that business (with the communities focus in 2017 appearing to involve a focus on "The Server") and so it has to prove to companies it can extend the concepts of safety, reliability, productivity and performance past the compiler boundary and into the operations space.

I believe this starts with the core language team (or a team very close to core) setting the mantra by creating a story for how this is to be achieved in a way that works not only for small hacker like companies but big huge gross ones (and all there bad dev practices) too. Rust is more than capable in doing this. The attempts I have made with Channels, threads and catch_panic work but introduce more complexity than the complexity its designed to remove so I am not happy with the current story in Rust.

I have no doubt people are working on this (I acknowledge the amazing design decision of futures to be poll vs pull and how this is going to contribute to making rust servers amazing) so perhaps I am either to early to the party, or I have been spoilt by the erlang mantra to such a degree that I should stop trying to recreate it, or marketing/community is just focusing on other things right now..or I am just Wrong<T>

3 Likes

Thanks for that post :). I do also believe that this is a fundamental topic for 2017 and the discussion just went off the rails - although I have the opinion that it is squarely outside of std. Let's hit reset.

So, maybe it would be helpful if we started asking questions about your concerns. You are certainly not Wrong<T>, you obviously tried something and ended up frustrated. We have a hard time following. That's no what we want to have.

My first questions are:

  • Would it help you if this issue was moved forward to a proper usable ecosystem? (Have a properly implemented actor system around) https://github.com/rust-lang/rfcs/issues/613
  • Where you held up in the state of implementing process-like patterns or when implementing the connections between them? Or where both the issue?
  • Is there any practical project you would like to use these things in? Maybe that could be a venue to work on these problems together.
  • Do you think we have a documentation/tutorial problem around the topic of building servers in Rust? Do you have the impression that you had to do a lot of research over reading about existing solutions?
4 Likes

For the purposes of this discussion I am staying away from actor systems. I see actor systems a opt-in. So hypotheticaly speaking if you thought a Mio based solution had too much latency jitter you could switch to an actor-model where you have more control over the scheduler. But systems concepts (system in the cybernetics/server/architectural sense not low level system) are generic enough to apply to anything with multiple components.

So no. I would like a way to build any rust binary and have at my disposal enough building blocks to build a system that is reliable in the face of faults. Rust already goes a huge way in this direction in a) removing multiple classed of bugs b) valuing simplicity c) Types like Option and Error d) Saf(er) shared state if you want it etc.
But the story stops at a panic. Its up to the dev to build a system around it to recover or what most projects appear to be doing is to favour the hard reset (failing fast is not a bad choice but currently it feels like the only one).

I think the issue was all the book keeping around the connections between threads. I used a worker thread that wrapped a function in catch_unwind and then sent a message up the supervision tree via a channel. This is quite crude. It also very quickly becomes a distributed system problem and I have to start worrying about consistency. Such worries are inevitable but I felt dirty solving these problems with channels. I see a way of improving this for users of the Tokio project - its poll semantics give me a point in the flow where I can inject a control plane i.e. the top level event_loop is first checking for control messages from the supervisor before polling the application stream.

I would be wary of any attempt for process like patterns outside an actor model. It should be about patterns of threads and a hierarchy of tasks of work. (Some might define this as a process pattern and I am happy to do so but I don't want to be accused of trying to reintroduce actors).

I will see if I can open source parts of the work I am doing and get back to you on this. In the mean time tokio seems like the right practical project to focus on. My problem here is any solution will be specific to a futures based system.

Rust's language documentation gets top marks from me :slight_smile: I enjoy the communities insistence on good docs. Now with Tokio anyone has a template for doing server systems. If anyone need someone different then I am going to assume that they know what they are doing and don't need any documentation from the Rust team up and above the lang docs and the Rust Book.

A question for you:

In your opinion what are the pain points for building distributed systems (or just servers to stay on track) in Rust?
In your opinion what are the pain points for operating Rust in production?

1 Like

When I started developing a web service with nickel i was very impressed that if I cause a panic in a request handler, the server will actually send a 500 internal server error reply on that request and carry on handling other requests.

That's pretty good resilience as far as I'm concerned, even if it isn't the guarantees you are asking for.

For me, these are the disconnects I'm experiencing in this thread:

  • It's very unclear what features you feel are lacking. I can't really find a specific proposal in your posts.
  • You do seem to be suggesting these features need to be language features, rather than libraries. But I am very skeptical of that. Rust the language is extremely small, everything is implemented through libraries.

The important thing is that the components you want to see are reusable, rather than having to be re-implemented for each project. I don't see any reason that requires language support, rather than some kind of library. I think something that is a part of tokio or built on top of it is the right place to solve this problem.

2 Likes

I think abort-on-OOM is a concrete example of where Rust struggles on the server. We don't have any way for developers to know which functions may potentially allocate. Tagging (perhaps via compiler plugin) methods which involve a potential allocation would go a very long way.

I also haven't found a comprehensive list of 'reasons Rust will abort()'. A complete list might help me sleep better at night, if someone knows of such a thing.

I'm planning to add alloc-tagging functionality to metacollect, but I have a very different project at the moment that keeps me busy these days.

1 Like

The constellation is more about coordination than about resilience. The fact that it helps us be resilient came after the fact when we made Servo multiprocess and later introduced panic catching.

I don't think we would benefit from building blocks, the constellation is pretty custom. There is some rearchitecting of the constellation going on, but I don't think it's in the direction specified here.