Moving running server from one machine to another

Followup to X86_64 moving running Rust prog from machine A to machine B - #5 by Michael-F-Bryan

Preconditions:

  • both machines are x86_64 linux with identical hardware/software
  • open tcp connection to server
  • no open files

Goal: moving a running server from one machine to another

===================================

In the original question, the concern was "how to move arbitrary Rust server." Suppose we allowed additional constraints (perhaps even using a different Langauge), are there easier/cleaner solutions.

One example: wasm32 constraint.

If we impose some additional conditions, ex: the server compiles to wasm32 and we run it in a wasmtime/wasmer container, then moving it is a bit easier as we can start with just "freezing" the wasm32 runtime, copying over the bytes, and 'resuming the wasm32' runtime.

Would other run times, like JVM / GoLang / Erlang make this problem easier to move from one machine to another ?

1 Like

I mean, Erlang has first-class built-in support for hot code loading (replacing code modules at runtime), location transparency (sending messages to a local process looks the same as sending to a remote process), a shared nothing architecture (no shared or global state, with a few caveats), and other features which would make "moving a running server" without downtime using just the language itself fairly standard and boring, depending on how the application were architected. I think the trickiest part would be replication of any state and orchestration of the switchover once replication was sufficiently caught up.

I'm not sure exactly what you mean by "moving a running server" though. Many servers, especially front ends, these days are designed to be stateless so they can just be replaced behind a load balancer and instead of moved / migrated. For service state, usually in a database, existing replication solutions (which is already necessary for data durability) usually solve the server replacement problem as well -- add a new server to replicate to, wait until fully replicated, fail over to the new server, and remove the old server.

So I think to reason about this hypothetical question, it has to be more specific and fully elaborated..

2 Likes

Valid point. Imagine something like an IRC server, but

  1. much higher bandwidth
  2. people join / leave channels much more frequently

Each channel is mapped to a physical EC2 machine.

Say #foo and #bar are both mapped to EC2-machine-1.

If #foo and #bar suddenly both blow up, I want to move one of the channels to EC2-machine-2.

This indeed sounds like a typical distributed systems problem to me, in which case there's no need to pull tricks like freezing a server process.

One of the great things about Erlang is you can be lazy and not write your server migration code until you need it. I'd probably just modify the module responsible for routing messages for a channel to also be able to forward messages between servers, hot load that code, and use the remote Erlang shell to tell server A to start forwarding messages for #foo to server B.

If this were Rust or most other languages, you'd have to think of that situation and write the migration code ahead of time, and definitely run tests or drills to gain confidence that the migration will work when the time comes.

That said, I'd probably actually use a different system architecture for the non-Erlang servers. I'd try to make the servers as "disposable" as possible, by e.g. allowing clients to connect to any of N servers for a channel (either through DNS round robin or a load balancer), and have servers discover each other (e.g. by registering the channels they handle in a shared Redis instance on a separate server or something) and forward messages to all other servers handling that channel. That way, we can add and remove servers for a channel fairly routinely, and also add redundancy for the case where a server goes down.

Can you recommend an Erlang/Elixir book chapter for showing these techniques?

The officially endorsed way to do this is through whats called OTP releases.. You create a new "release" of your app, define the state transitions between releases, and deploy the new release to a running server. The book Learn You Some Erlang describes this. OTP releases are however pretty complex as they were designed to be deterministic and extensively tested for critical telecom infrastructure in a large organization (Ericsson).

I personally prefer and have more experience hot loading code modules manually, which is simply by calling l(Module). in a remote shell, since it's vastly simpler and gives you more control, but I haven't seen a book describing that technique.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.