There is an interesting remark here, which I would like to get to the bottom of:
"Also note that the Acquire fence here could probably be replaced with an Acquire load, which could improve performance in highly-contended situations. See 2."
I don't think the comment is correct. I think it shows a mis-understanding of what fence does. Specifically, although using a load with Acquire would be correct, it would not improve performance, because fence simply prevents the compiler from re-ordering operations that affect memory, and both the load and fence will do the same thing ( but a load might actually be less efficient ).
But maybe I am wrong...
I think the problem here may be that almost everyone ( certainly myself included ) is not very expert on atomics and Acquire/Release, as it is very rare that you need to know about it or use it. I have been studying it today though.
Anyway, do you think the comment is correct or not?
Incidentally, is there some simplifying way to think about Acquire and Release? My own tentative mental shortcut is Release means "I am done writing, flush everything to shared memory", whereas "Acquire" means very roughly, I am going to be reading stuff, load everything from shared memory. I am not saying this is accurate, but it perhaps is a start to knowing what to do.
That’s compiler_fence. But Arc uses fence, which also places the same constraint on the CPU, which may require some communication between the CPU cores to make sure pending stores have propagated to all cores, or whatever. I really don’t know the details on how that’s implemented.
Roughly, Release = what you’d do when releasing a lock, Acquire = what you’d do when acquiring a lock.
fence states:
"To achieve this, a fence prevents the compiler and CPU from reordering certain types of memory operations around it."
So... it seems this is all about reordering. So... I don't expect a load rather than a fence could be more efficient, but maybe someone knows better, the subject is a little mysterious!
Have you read Mara Bos’s atomics book? Maybe there’s something in there.
As for why a load might be better than a fence: a fence does provide stricter guarantees. No clue if it’s actually more costly for the CPU to implement those guarantees, though. Famously, Acquire and Release on ARM are slower than Relaxed on ARM, while those three orderings have the same cost on x86. On both architecture families, SeqCst is strictly more expensive than the weaker orderings. There’s probably similar “there’s no difference here, but it matters there” stuff with fence.
I just had a quick look, and the chapter on Arc uses fence.
It also explains why it uses fence though:
We could use AcqRel memory ordering to cover both cases, but only the final decrement to zero needs Acquire, while the others only need Release. For efficiency, we’ll use only Release for the fetch_sub operation and a separate Acquire fence only when necessary
I have read more of it now, especially chapter 7, "Understanding the processor". But I don't think it answers the post question on whether a load or fence is to be preferred in Arc Drop, although on x86 the fence with Acquire generates no instruction, so at least on x86 I doubt load is preferable.