Dynamically choose between Arc or Rc


#1

Hi All,

I have a simple datastructure that can be used in both single-threaded or multi-threaded application.

This data structure comes from the functional word where is common to have a garbage collector and I have translate this in the use of Referece Counting.

Now I would like to use standard Rc when the data structure is used in single thread application and to use Arc when it is used in multi-thread application.

Is this possible without duplicating the code?

Thanks


#2

How about something like a typedef which maps to Rc or Arc depending on whether a “thread-safety” cargo feature is enabled?

#[cfg(threadsafe)]
type MyRc = Rc;

#[cfg(not(threadsafe))]
type MyRc = Arc;

In this scheme, the reference counting scheme is selected at compile time rather than runtime, which is probably the best trade-off from the point of view of performance and compile-time verification, at the cost of some usage flexibility.

An alternative would be to use Arc even in single-threaded mode, as if it’s not on the hot code path the performance impact will likely be negligible.


#3

Even then, is it even guaranteed that Arc is going to be worse performance-wise? I thought the only problem with atomic types was that they prevented some Rust-specific optimizations.


#4

If you clone (and drop) a lot, particularly in fast/hot paths, then it’ll likely impact performance due to using atomic increment and decrement instructions.


#5

I remember hearing that on x86 all increment and decrement instructions are atomic (edit: apparently this is true as long as the integer is naturally aligned.)


#6
  1. I’m pretty sure this isn’t the case.
  2. Not every architecture is as forgiving as x86.

#7

While is true that most instructions are atomic, this is usually not the issues at hand.

Think of two thread running on two different processors. Each processor will have its own L1 cache.

A value in a register will be modified in each of the L1 caches, but what will be the final value?
Atomic are implement, as far as I know, in a different part of memory, slower, but shared between all the processor.

I am not that sure about this, so please take it with a grain of salt and if somebody listening has a better understanding than me I would love to learn something new.

:slight_smile:

Anyhow the solution at compile time is quite interesting, thanks :slight_smile:


#8

Not quite. In multicore CPUs, the private caches actually implement a kind of distributed reader-writer lock which allows them, given enough expensive broadcasting communication on the interconnect between CPU cores, to assert that a given L1 cache either has exclusive write access to a piece of data or shared read access.

Add to that the ability to track whether a piece of data in cache has been modified since it was loaded from RAM (so that the new value is eventually written back to RAM), and a null state for unused cache lines, and you get the full complexity of a cache coherence protocol.


#9

I think replacing all Rc with ARCs in CPython has lead to 3x worse performance, if I remember this presentation correctly: https://www.youtube.com/watch?v=P3AyI_u66Bw


#10

Cache-coherent architectures (most of the mainstream ones) already synchronize their view of memory via cache coherence protocols. In addition, the protocol ensures that only a single core has exclusive access to a cacheline in order to write to it - this protocol is in play without atomic operations involved.

What atomics add is ordering of memory reads/writes. This ordering is needed for two reasons: (a) compilers can reorder code and (b) CPUs can reorder operations. Intel/AMD have, for example, strong memory models in the processor - the processor does not do much reordering of instructions. The only reordering it can do is make a load operation appear before a store operation. The reason this happens is because these CPUs employ a store buffer - this is a limited depth buffer local to a core that holds pending writes. While a write is in the buffer, it is not visible to other cores until the write is drained out of the buffer (into the L1 cache, where coherence protocol picks up on it). A subsequent load, however, can be resolved from the cache hierarchy, however. This gives the appearance of the load happening before the store. In addition, the CPU that wrote an entry to the store buffer can satisfy a load of that memory location out of its own buffer.

So, on Intel/AMD, to ensure that a store-load is not reordered, you need to insert a synchronizing instruction between the store and the load (e.g. mfence, lock prefixed instruction, etc). What this ends up doing is making the CPU wait until the store buffer is drained before continuing the instruction stream. In order for an entry to drain out of the store buffer, the CPU must have the target cacheline(s) brought in exclusive state. Once the store buffer drains, we know other CPUs can see the data and so we preserve the Store-Load order. Note that the store buffer is always being drained by the CPU - this doesn’t “force it” to drain (common misconception); it simply “pauses” the CPU until the buffer is drained. This pause (stall) is what can really hamper performance on Intel/AMD processors because it kills the out-of-order/pipelined execution capability.


#11

Choosing at compile time like @quadrupleslap suggested would be ideal.

Technically it is possible to choose dynamically, even at run time, without any code duplication. However, that level of dynamism itself may be more expensive than Arc itself, because you’d have to use dynamic dispatch and end up with double indirection.

  1. Make a trait with interface of Rc/Arc
  2. Implement that trait for both Rc and Arc
  3. Use your trait instead of actual types

If you use static dispatch fn foo<R: MyRcTrait<T>>(ref: R) then source code will be written once, and the machine code will be compiled twice for Rc version and Arc version.

If you use dynamic dispatch Box<MyRcTrait<T>>/&MyRcTrait<T> then it will be written once, compiled once, and expensively dynamic at runtime.


#12

It says 30% worse @ 12 minutes


#13

@TyOverby once prototyped a crate with a ref-counted pointer that does non-atomic operations for counting references within a thread, but also has an atomic cross-thread counter that is touched only when sharing a pointer across threads. (Conceptually it’s similar to Rc<Arc<T>>, but without the double allocation.)


Crate of the Week
#14

The performance difference of Arc has less to do with being slower to execute on the CPU (the way x86 works, many instructions are atomic anyway, and performance only takes a hit if there is actual contention with another core) and more to do with the inability of the compiler to optimise them away:

With an Rc, the compiler can trivially coalesce adjacent increments/decrements to the counter and remove them entirely. With atomics it can’t do that because the effect might be observable on another thread.


#15

In theory it should be able to do a lot of that with Arc, given that Arc’s increments (but not decrements) use the ‘relaxed’ ordering… but maybe LLVM is not smart enough?


#16

The impact on the compiler is certainly there, but whether it or the CPU cost dominates is going to be circumstantial. The fact that x86 has “atomic” accesses on aligned word-sized data isn’t really what the atomic instructions are about, as I mentioned upthread. Enforcing memory order is what costs you. Modern x86 has a crazy out of order execution engine so even if compiler can’t remove some basic operations (eg coalesced increments of the refcount) the core might be able to hide it. In particular, once the first increment cache faults (let’s assume that as the worst case), the second increment is going is to hit L1 and the increment itself is 1 cycle.

The contention (over cachelines) is still there even with non-atomic operations. Try benchmarking multiple threads reading and writing to the same cacheline with plain instructions (so a data race) vs doing the same to their own cachelines and you’ll see performance tank as the cachelines ping-pong across the cores.

I’m pretty sure that the memory model doesn’t require that each write is made visible individually. Whether another CPU can see two writes vs one (coalesced) is up to scheduling and there’s no way to arrange for that in code that doesn’t explicitly communicate with happens-before edges. Of course compilers tend to play somewhat conservative with atomic operations. But see http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4455.html for a more thorough treatment.


#17

Having tried a variant of this experiment personnally, I must advise interested readers against trying it out in Rust. Prefer a more “stupid” language, like C or asm.

Why, you may ask? Because a data race is undefined behaviour, and if LLVM’s optimizer sees an obvious one, it will feel within its right to translate the code into complete gibberish as a punishment. Like, for example, turning finite loops into infinite ones in release mode.


#18

:slight_smile: You should be able to hide/obfuscate the race by using raw ptrs and/or marking leaf functions doing the write as #[inline(never)]. But certainly Rust isn’t needed to observe the coherence costs under such workloads.


#19

Raw pointers are not enough. My experiment used an UnsafeCell as AFAIK that’s the only legal way to have aliased mutable state in Rust (any other code may break in the future once LLVM’s noalias directive is fixed and re-enabled in rustc). So I was already using pointers.

#[inline(never)] might work, but for this purpose it is a bit ugly as it forces function call overhead into the microbenchmark.

Another thing which I would have tried if I didn’t find another way around my specific problem is ptr::volatile_read/write, which is a way to state “don’t optimize this write out, as some unknown hardware may be listening to the memory bus”. But it only eliminates one specific kind of harmful optimization (namely optimizing out writes which no one should be able to read), and a sufficiently malicious compiler with the high-level knowledge that a data race is going on can still trash the code under the ground that it’s UB…

What I ultimately did on my side was to remind myself that on all popular CPU architectures, a sub-word-size aligned write or read is always atomic, which means that a read or write to an AtomicXyz variable actually translates to a plain read/write. So I stopped fighting the system and rewrote my artificial data race as a pair of atomic accesses to two atomic variables far apart in memory, crossing fingers that compilers won’t be turning pairs of atomic accesses into atomic transactions anytime soon :stuck_out_tongue:


#20

You can also do the experiment by writing to adjacent array/vec slots and induce false sharing - the net coherence effect is the same as if you’re racing on a single location.

Sure, but the benchmark isn’t to measure absolute timing - it’s to compare relative cost of false sharing (either direct or indirect) vs not. Having constant overhead for both is fine, IMO.