Combining AtomicPtr with reference counting?

I’m interested in using Rust for realtime audio stuff. In these use cases the audio rendering thread has to be absolutely non-blocking and can start at any moment, potentially descheduling all running threads, so your data structures have to be in a valid state at all times. As such, you’d generally use constructs based on (lock-free!) atomics. For example, a volume slider simply communicates a single value so you’d use a AtomicFloat. Locks/mutexes are generally avoided, often like the plague, as anything blocking your audio thread causes trouble.

This gets tricky whenever the data you’re interested in gets too large to update atomically (generally >64 bits), such as an audio sample. Whenever your non-audio thread changes the sample, you want your audio callback to read either AAAAAAAA or BBBBBBBB, but never AAABBBBB (which could happen if one of your threads is descheduled). You can do this by creating an indirection using an atomic pointer. This points to your object and now your realtime callback always reads either object A or object B, never some halfway in between state. A more in-depth and much better explanation can be found here:

Atomic pointers work well, but several decades of experience have taught us that handling raw pointers is a very bad idea. In my program I have a lot of structures that I can reference in multiple places, so I would like to use reference counting. Now I’m trying to create some sort of Arc/AtomicPtr hybrid, that uses reference counting for non-realtime threads and allows wait-free, atomic “fleeting” references on realtime threads:

  • Non-realtime threads should interact with the value as if it was an Arc (e.g. reference counted)
  • A realtime thread should be able to get a valid reference from get_fleeting at any time, without waiting
  • This object should not be deleted while the realtime thread is using it
  • The destructors can only be run on non-realtime threads

This playground shows more or less the interface I have in mind (RtUsableArc) and a toy problem, but I still have two problems:

  • There is a data race in load(), as thread 1 could get a pointer, thread 2 could delete that Arc/object and thread 1 would then try to transmute the now invalid data into an Arc again
  • I need a way to not run the destructor in safe_delete until all fleeting references are gone (currently it simply waits a few ms, although I plan on moving that to some form of epoch-based as in crossbeam)

Any ideas on how to best tackle this problem?

1 Like

Have you looked at the arc-swap crate?

2 Likes

How many readers / writers do you need? Do you need a get operation, i.e. the stored value implements Copy and is only replaced by writers? How often do you read / write?

I have looked into arc-swap and used it for a bit, but felt a bit uncertain of the performance characteristics of Guard. I also ran into trouble using it with Serde (which I'm not willing to forfeit to easily), which required some ugly code to get working.

I'm still considering it as a backup or temporary option though.

I'd like to have the solution to be as generic as possible. In general not more than a few realtime reading threads (most often only 1 even), which are reading several 100 times a second. The non-realtime accesses are considerably rarer, but it is hard to give good estimates on that.

I'm not entirely sure what you mean with

Do you need a get operation, i.e. the stored value implements Copy and is only replaced by writers?

Did you check the example? The referenced object could be anything and doesn't have to implement Copy.

Now, I did. No Copy, then. Does the actual stored type implement Send + Sync? If not, it's impossible to provide a safe interface to get the reference, because interior mutability of a type, that is not thread-safe, can lead to data races, if not handled correctly.

You said no locking, but I assume, that it's OK for the non-real-time threads. I have 2 possible solutions for the no-deallocation-in-real-time-threads issue in mind. The first one requires a linked list, the other one requires conditional locking in a non-real-time thread. I don't know which solution is better, yet, but my guess is, the locking one is faster (on the real-time thread, at least).

Author of arc-swap here. Can you be more specific? Is there something I can improve, or document better? With that serde ugly code, can you open an issue and show it there?

Thanks for the info :slight_smile:

1 Like

Oh wow I feel honoured, the maintainer himself.

I guess I'm just a bit too paranoid about the Guard.

  • Especially overrunning the number of "cheap" slots without noticing and accidentally creating a locking situation sounded a bit scary. But on second look, that isn't actually a danger and it just falls back to load_full(), which is wait-free?
  • When using the DefaultStrategy, how many "cheap" slots do I have?
  • Also, compared to simply taking the raw pointer and dereferencing it (however unsafe that may be), there is little overhead to be expected? What about with load_full()?

My main issue using Serde were the orphan rules, meaning I was not allowed to implement Serialize/Deserialize for ArcSwap, leading to ugly workarounds. Serde implements those Traits for Arc/Rc (has to be explicitly allowed using features = ["rc"]), so I guess if those could be implemented for ArcSwap my problem would be solved. It shouldn't be too difficult, although I'm not sure if they should be an explicit feature too?

Especially overrunning the number of "cheap" slots without noticing and accidentally creating a locking situation sounded a bit scary. But on second look, that isn't actually a danger and it just falls back to load_full(), which is wait-free?

Yes, it's still wait-free, just a bit slower.

When using the DefaultStrategy, how many "cheap" slots do I have?

Each thread has 8 (that is, you can keep 8 different Guards around in each thread at once and will fit).

Also, compared to simply taking the raw pointer and dereferencing it (however unsafe that may be), there is little overhead to be expected? What about with load_full()?

Yes. The biggest part is 1-2 SeqCst read-write atomic operations. The load_full additionally increments the count on the Arc, which might be slow if multiple threads do that at once.

These benchmarks compare different methods of acquiring something, you could add that unsafe load thing and try. https://github.com/vorner/arc-swap/blob/master/benches/background.rs

For serde… I will think about that.

Unless you need to de/serialize though multiple layers of reference counting, you're almost certainly better off serializing &*rc than &rc (i.e. serializing &T rather than Arc<T>). The rc feature for serde de/serializes the (a)rc as if they were just Box, and is rarely what you actually want for deserializaton.

Furthermore, it's not actually all that difficult (thanks to #[serde(with)] to still derive Serialize that just ignores the Arc wrapping layer. Deserialize is always going to be hard if you want actual sharing.