Distributed, "pairwise transactional", HashMap<u128, u128> in Rust?

We define 'pairwise transactional' for a HashMap<u128, u128> as follows:


pub struct Delta {
  key: u128,
  old_value: u128,
  new_value: u128
}

impl HashMap<u128, u128> {
  pub fn pairwise_transaction(&mut self, d1: Delta, d2: Delta ) -> Result<...> {
    if self.get(d1.key) != Some(d1.old_value) || self.get(d2.key) != Some(d2.old_value) {
    fail;
    }

    self.insert(d1.key, d1.new_value);
    self.insert(d2.key, d2.new_value);
  }

  pub fn read_keys(keys: Vec<u128>) -> Vec<u128> {
    input arg u128 is key; output u128 is value
  }

}

Here, "transactional" refers to: for any read of keys, it returns a consistent view, i.e. not halfway through a update. In the above case, the Rust borrow checker ensures that when a write occurs, we have a &mut self, which means there can not be another thread doing a read (needs a &self).

Question: Is there a easy way to make this work in a 'distributed' manner (i.e. the HashMap maintanied by multiple Rust programs running on multiple machines), or is this something that requires using databases for ?

I am not sure what to make of this statement. Rust is a programming language. You can implement anything with it, if you know how to. You can make a distributed, transactional database in Rust.

The general answer to this question is "No, because of the CAP theorem - Wikipedia".

That hasn't stopped people from trying, but I don't know a single one that has worked well. See, for example, .NET Remoting, which is on the list of .NET Framework technologies unavailable on .NET 6+ | Microsoft Learn (also Breaking change: Remoting APIs are obsolete - .NET | Microsoft Learn ).

Hiding the distributed-ness of something is also mistake #3 in this article from way back in 2000: Three Wrong Ideas From Computer Science – Joel on Software

3 Likes

@RedDocMD @scottmcm : There is an unintended confusion here. Let me rephrase this as:

I know this problem can be solved, i.e. just run distributed Postgres / Cockroachdb on a cluster.

My question is: is there a light weight way do to this (perhaps as a rust crate) or does solving this require a distributed transactional database.

Your problem is underspecified to be able to answer this question. What do you expect to happen, for instance, if node A attempts to insert (1,2) concurrently with node B inserting (1,3)? What behavior do you need in the face of a network partition? How much communication latency will exist between the nodes (typical and worst-case)? What read and write loads do you expect? Etc...

1 Like

If by 'insert', you mean

  • A does pairwise_transaction with keys 1, 2
  • B does pairwise_transaction with keys 1, 3

then I expect one to succeed, and one to fail. Which is fine, the client has the responsibility to retry.

Assume similar setup / expectations as a Postgres cluster running within a single AWS datacenter. (But hopefully higher throughput / possibly lower latency, as we are dealing with a 'more restricted' problem.)

The simplest solution here is to have a master process somewhere that serializes all updates, and either responds to queries or broadcasts accepted changes; that can be either Postgres or something bespoke. I doubt that this is what you have in mind, as most people wouldn't describe this as "distributed."

Dealing with distributed algorithms is generally a pain, so I try to avoid them whenever I can. What benefit do you hope to gain by using something "distributed" instead of having a single point of truth?

1 Like

100% agree.

Throughput, latency. A bunch of XY-problems / reductions ended up converging on the above primitive -- which is fairly well known, it is basically Double compare-and-swap - Wikipedia .

So the question can also be rephrased as: is there a rust crate for doing high performance double-compare-and-swap over multiple machines.

No, there is not.

1 Like

Is this so easy that it is not worth putting in a crate, or so messy that no one has figured out a nice way to do it ?

The latter. There are a lot of questions that will have answers specific to the particular application, like:

  • How do you identify and find the cluster of machines?
  • How many connections should each machine maintain, and to which others?
  • How do processes join and leave the cluster gracefully?
  • What happens when a process stops responding? (This could indicate either a node failure or a communication failure, with no way to tell between them)

Because of this, solving the problem in a general-purpose library is significantly more complicated than for a single application.

2 Likes

Well, I don't actually think anyone has tried, but even then, I don't think there's any known good way to do it.

2 Likes

(post deleted by author)

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.