Both Rust crates & Algorithms / Theory papers welcome.
Imagine we have a cluster computers; where computers are regularly being added and removed (crash) from the cluster.
Suppose we care about two ops:
Set(u64, V) => ()
Get(u64) -> Option
When we have n machines, we want the space of u64 to be split over the n machines.
When a new machine is added, we want it to "steal portion of the keyspace" from existing machines.
When a machine crashes, we want to "redistributed key space (of crashed machine)" to the running machines. During the crash, we lose the V's, so we end up temporarily returning None's until a Set happens.
Question: what algorithms solve this problem? what Rust crates exist for this ?
If I understand your requirements correctly you want to "shard" data over many machines. Because of its shear size or for performance. You also want it to be fault tolerant.
For the fault tolerant part you probably want to look into Lamport's "paxos" algorithm or new "raft" algorithms:
As you say, as far as I can tell the nodes in the cluster need to achieve consensus.
Having kept half an eye on this kind of thing since discovering the Byzantine Generals paper many years ago I'm not aware of any better solutions to arriving at consensus than Raft/Paxos.
If they are not the right approach I'd love to know what might be.
Still, I'm sure there are many smart people who have been thinking about the consensus problem over the years and I'm not aware of their work.
This is the same boat I am in. This sounds like a sufficiently general problem that I am confident the 'right solution' exists out there. I am just not sure what it is called.
No, I have not read any of those papers. I have read Distributed hash table - Wikipedia a few times and tried to find a textbook on Amazon (to no avail).
Is there a consensus yet on which algorithm is best, or is there a number of different algorithms, depending on specific scenarios ?
I have no idea, my knowledge of distributed systems ends about where @ZiCog's does :) But you can almost certainly read and understand many of the academic papers about this problem without reading a whole textbook first, CS papers tend to be pretty friendly that way (much more so than e.g. pure math papers).
Two papers I would start with are Chord, which introduced the use of consistent hashing for distributed key-value storage, and Dynamo, a practical implementation of similar architectural ideas, used for some of Amazon's early distributed systems.