X86_64 linux: cas4 of AtomicU64

Assume:

  1. we are on x86_64 linux
  2. single machine, multi cpu, multi core / cpu
  3. we have a big data: Vec<AtomicU64>
pub struct EntryRequest {
  index: usize,
  old: u64,
  new: u64
}

impl EntryRequest {
  pub fn can_apply(&self) -> bool {
    data[self.index] == self.old
  }

  pub fn apply(&self) {
    data[self.index] = self.new;
  }
}

pub fn single_thread_process_request(r: (EntryRequest, EntryRequest, EntryRequest, EntryRequest)) {
  if r.0.can_apply() && r.1.can_apply() && r.2.can_apply() && r.3.can_apply() {
    r.0.apply();
    r.1.apply();
    r.2.apply();
    r.3.apply();
    return true;
  }
  return false;
}

Now, the above function does not work in a multi thread situation due to potential race conditions.

Question: how do we modify this code to work in a multi thread solution, with performance in mind.

I.e. for each t in {10 milliisecond, 1 millisecond, 100 microsecond}, conditioned on 99% of requests returning in < t, how do we maximize throughput ?

Atomics do not support this. You would need to synchronize their accesses via some other mechanism.

For example, you could use a mutex. There are solutions like e.g. having a separate AtomicU64 with a bit per element (if the length is at most 64) and locking each index separately via that atomic. Maybe it's also possible to use a single bit of each u64 to store a lock on that field if you never actually need all 64 bits. There are probably other variants too.

I agree.

Pehaps many mutex; but definitely not a single mutex for the entire array, as that would be unlikely to create high throughput.

I agree with the gist of this; but the devil is in the details (for example: what type of Ordering to use?). I am hoping someone else has solved this problem.

If you pick a specific implementation strategy, we could discuss how to implement it and what orderings to use. I do not know which one would maximize throughput ­— that would require some benchmarking to determine. I'm not familiar with any crates for this particular problem.

I don't have any that I am confident in. When I first wrote the question, the reasoning was:

cas-N is a common problem
cas-4 atomicu64 is a very special case of cas-N

therefore there should probably already be a well established 'known solution' to this problem

It seems like you're looking for the Software Transactional Memory (STM).

Yes in that STM definitely solves this problem; but STM is also overkill for this. In particular, the problem described above can be used as a building block to implement STM.

There may very well be some established known solution, but I don't know of it.

To maximize throughout, you need to increase the distance between the can_apply and apply. Even if you could do this atomically you would be taking a pretty big hit in performance if these process requests had too much contention (overlapping indices) as every core keeps getting it's cache pulled out from under it.

I would be looking for ways to partition your incoming process requests so you can apply the dependent ones on the same core if possible, but it seems like you're trying to parallelise a fundamentally serial algorithm, so pretty much any approach is going to suck without being able to make assumptions on the shape of the requests coming in.

Imagine we are doing a dumb implementation of STM. It probably goes something like this:

let X = addrs this transaction reads from
let Y = addrs this transaction writes to
on commit:
  atomically {
    if nothing in X changed
    then write all values in Y
  }

This is a problem where we have no control over the sets X/Y -- the programmer controls it -- and is at the very heart of STM. 'cas4' described above can be viewed as a (1) a very specialized case of this problem and (2) a fundamental-ish building block where most STM impls probably use something similar-ish in principle to cas4.