Do you think that atomic operations are not 100% lock-free?

I do not have a formal CS education. My approach to learning CS is starting with fundamentals, and understanding the topic well enough to make some assumptions and discuss about it with other enthusiastic people. So the title of this post is my assumption.

Everyone will agree that atomic operations are either fully completed or have not yet happened. These are two states of atomic operations from the perspective of outside onlookers during the running of the program. These states are possible with the help of special instructions and designs of the hardware.

Now imagine we have an atomic variable called DATA, hundreds of threads are writing to the DATA at concurrently. Because of the atomicity or indivisibility of atomic operation, no other thread will find the DATA half-written by another thread.

If the DATA is being written by a thread, at the same time another thread wants to write to it too, what happens here? Does CPU block the other thread's attempt? OR that other thread find the old value of DATA?

I know the atomic data type in Rust is not generic. They are primitive data types, mostly aligned to the CPU architecture so operations on them happen so fast. As a result, a thread rarely finds another thread writing to DATA. However, theoretically, it is possible that during an atomic operation's execution, another thread might want to access that atomic variable. Here it seems interesting to know how the CPU handles this situation.

More assumptions on CPU blocking: Blocking closely resembles locking and unlocking in a loose sense. Hence my intuition(whether it is correct or not) is atomics is not 100% lock-free.

More assumption on finding the old value: If another thread finds the old value of DATA which is problematic for the program's correctness. So we use stronger memory ordering so that that the result of atomic operations is observed immediately to other threads which costs a little performance.

1 Like

This will be highly CPU-specific, but the way this is implemented at the hardware level is indeed with something resembling locks, often a lock on a memory location or on a cache line.

However I suspect people don't use the word "lock" when referring to this because locking is often associated with deadlocks, and the overhead can be negligible if you don't have a lot of contention.

7 Likes

No, "fast" is not the reason. "Possible" is. You can't have an atomic for which there's no hardware support. A 64-bit architecture that gives you 64-bit atomic writes won't make it possible to implement 128-bit atomic writes; you'll need additional synchronization in software (ie., a mutex) for that.

You are confusing definitions and domains. The point of "lock-free" code and data structures is that they use a few atomics directly, orchestrating reads and writes in a clever way, so they can therefore be faster than a full-blown mutex protecting an entire sata structure. (Eg., a concurrent hash map can update and read any number of keys at once, truly in parallel, as long as they are disjoint. A regular hash map protected by a mutex can't.)

This is a loosely-defined term, and is not quantifiable. It makes no sense to ask "how many percent" an operation is "lock-free".

You also appear to be confusing "lock-free" with "free". Are atomics faster than using a full mutex? If used correctly, they can be. Are they free? No, they aren't free. But then, nothing is free.

In any sort of synchronization primitive, you have to have some synchronization (ie., blocking), somewhere. That's the whole point. Otherwise, it wouldn't be useful/correct.

It looks to me like you are expecting magic. There isn't any magic, sadly.

9 Likes

What your describing is one form of memory pressure. Typically you hear references to cycles taken, and excess atomic access can cause the efficiency of threads to decrease substantially. Intel VTune is good at highlighting such stalls.

1 Like

Remember that all of this is done in terms of a model of a system, not real hardware, and we use the facilities provided by real hardware to implement the model.

In the Rust model, if multiple threads try to atomically access the same location at the same time, they take turns doing so; the ordering of the threads accessing that location is random, but they will each get a turn controlling it.

Real hardware (CPUs and the like) will provide either load-linked/store-conditional (ARM, RISC-V) or compare-and-swap (x86), and may also provide a set of atomic operations of their own on memory (e.g. fetch-add). Details of how these are implemented would be best found on a hardware-specific forum, not a software forum (there's a decent stack of complexity around details of hardware that simply don't exist in the software view of the world).

Rust compilers will use whatever the CPU provides to implement the Rust model on real hardware.

8 Likes

Even cache coherency protocols could look like locking if you examine deeply, e.g. for non-atomic writes on the same cache line (false sharing).

3 Likes

I think this is the useful context for addressing the OP query. A 64-bit machine can, generally speaking, write to work RAM in chunks of 64 bits (sometimes more) in one step using a parallel data bus (as long as things are aligned); at no point is the data only partially written (like it would be with a serial bus).

The problem is that reading/writing main RAM is SLOW, so the CPU has multiple layers of cache that work to speed up memory access. The catch is that this means that reads/writes are variable latency depending on cache status.

The result is that if multiple threads are looking at the same region of memory (such as an atomic), latency for memory access gets higher than it would've been in the absence of sharing. Chip designers have done many clever optimizations to reduce the impact of maintaining coherency, but ultimately added latency when shared writes occur is necessary to get the orders of magnitude improvement from CPU cores able to work on disjoint cache lines in true parallel.

Weak memory models in hardware complicate this, but the underlying concept is the same, just extended with eventual consistency concepts, reducing total overhead when utilizing weaker memory orderings.


Theoretical: if two cores try to write the same address at the exact same instant, whichever write “wins” is arbitrary. If fact, so long as no other thread has read either written value, which write “wins” remains arbitrary. But as soon as the cache manager acknowledges a write, other cores are going to be told that they need to reload any invalidated cache to see the change, adding latency to the next read.

How this is coherency is accomplished doesn't matter except if you're designing/debugging a chip or dealing with false sharing (writes pessimize loads from the entire cache line, not just the bytes written.)

2 Likes

The term "lock free" means that if the thread scheduler chooses not to execute one thread, the other threads will still keep making progress. It doesn't mean "no delays at all", it just means some progress will be made.

CPUs synchronizing access to memory is below that level. Yes, one CPU writing will delay another CPU reading, but that doesn't make the accesses non-lock-free. CPUs don't go to sleep, they are relying on constant communication, so progress will always be made.

Now there is a stronger concept of "wait free": that every thread will make progress in some bounded amount of time, so any delays are limited. It makes sense to talk about this at the hardware level. However, hardware is normally designed to avoid such starvation of an individual CPU, every operation has guaranteed bounded time, so they can be said to be wait free also.

7 Likes

Off topic but: Really? Where are these bounds specified?

Any memory access can raise a page fault. As far as I know on Windows, Linux, etc it could take potentially forever for the kernel to handle that.

They're specified in your CPU core documentation (sometimes only in the stuff you get under NDA), with the bounds including a term for the OS page fault time; while you're guaranteed progress in the case of no page faults, the guarantee if a page fault is involved is typically only if your OS provides a forward progress guarantee on page faults.

3 Likes

That's irrelevant. A operating system handling a page fault is still executing CPU instructions, the CPU isn't being starved.

2 Likes

And just to give you a taste of what the specifications can give you, the following is paraphrased from a CPU document:

  1. CPU fetches one cache line of instructions at at time; as long as the load-linked/store-conditional (LL/SC) loop including the terminating branch is in a single cache line, you cannot take an instruction fetch fault during the loop.
  2. LL takes 2 cycles if the data is in this core's L1 cache, going up to page fault time plus a full memory access cycle if the data is not cached.
  3. As long as the processor is not interrupted, the processor will execute at least 3 register-only instructions from the same cache line as the LL before the SC while retaining exclusive ownership of the memory location.
  4. SC is guaranteed to succeed if it's in the same cache line as the LL, the instruction after the SC is a branch to the matching LL, and no interrupts have been taken between LL and SC.
  5. If the branch after the SC is in the same cache line, and branches backwards to any LL and is taken four times in a row, then the CPU will defer interrupt handling for 8 cycles after taking the branch.

Between these guarantees, you get the ability to build a forward progress guarantee; you can't have your page table shot down underneath you (forcing a page fault) without taking an interrupt first, and as long as you ensure that your LL/SC loop and your atomic location remain paged in after the LL for at least 5 iterations of the loop (which is a guarantee the OS can provide by guaranteeing that a thread gets to keep the most recently accessed data page and most recently accessed code page in memory after an interrupt), you've got an upper bound on loop time.

9 Likes

Great. Last time I seriously studied instruction set details they just told you how many clocks they would take:

Glorious old times.

3 Likes

You're talking about atomic writes, not RMW operations. In those cases the ops will "just" bounce the cachelines around, but they won't necessarily block execution of the next instructions in a modern out of order CPU because writes go into a store buffer which is then handled in parallel with them.

It's only the RMW ops that have to wait to make progress.

Non-blocking algorithm - Wikipedia is worth a read regarding those terms.

3 Likes

Google MESI protocol

2 Likes