Premature optimization — trying to understand atomics and ordering

This is the first time I'm using atomics in any language. I'm hoping to receive some feedback on how I understand them.

Literature that I've already read:


In my application I found the need for a set data structure shared between threads[1]. The set only needs to support insert and remove (remove for existence checking, each item only needs to be checked once).

In its current form it's an Arc<Mutex<HashMap<T>>>. In actual usage, I found that the set is going to be empty most of the time. This leads me to think that, if I could separately track whether the set is empty, most of the checking could perhaps be done without locking and therefore be faster[2][3].

Below is a version I came up with (for the sake of readability I'm using u32 for T and thus omitting all the HashSet trait bounds):

use std::collections::HashSet;
use std::sync::Arc;
use std::sync::atomic::{AtomicBool, Ordering::*};

use parking_lot::Mutex;

pub struct SharedSet {
    set: Arc<Mutex<HashSet<u32>>>,
    is_empty: Arc<AtomicBool>,
}

impl SharedSet {
    pub fn insert(&self, value: u32) {
        let mut set = self.set.lock();
        self.is_empty.store(false, SeqCst);
        set.insert(value);
    }

    /// Returns true if the value was present in the set.
    pub fn remove(&self, value: &u32) -> bool {
        if self.is_empty.load(Relaxed) {
            false
        } else {
            self.remove_slow(value)
        }
    }

    fn remove_slow(&self, value: &u32) -> bool {
        let mut set = self.set.lock();
        let contained = set.remove(value);
        self.is_empty.store(set.is_empty(), Release);
        contained
    }
}

Here are my thought processes on each part:

insert and SeqCst

pub fn insert(&self, value: u32) {
    let mut set = self.set.lock();
    self.is_empty.store(false, SeqCst);
    set.insert(value);
}

This sets is_empty to false before inserting the value. The intuition, and how I initially wrote it, was to set the flag afterwards:

set.insert(value);
self.is_empty.store(false, Release);

However, I believe this would allow for a situation in which remove from another thread could check is_empty between set.insert and is_empty.store, resulting in it incorrectly taking the fast path. Essentially, setting the flag before insertion prevents "false positives" (believing the set is empty when it is not) which are unacceptable for the application, in exchange for "false negatives" (believing the set is not empty when it could be) which would only result in remove taking the slow (albeit correct) path.

My question is: is the use of SeqCst necessary here to prevent the aforementioned "false positives", i.e. to prevent the set from being populated when is_empty is still true, which, if I interpret this correctly, could theoretically happen due to compiler or hardware reordering? I understand that a Release store ensures operations before it are visible ("happen-before") to another thread (after a corresponding Acquire write), but I think here I am trying to achieve the somewhat opposite: ensuring set.insert is always after the flag is set ("NOT happen-before"?) and not somehow flipped, however unlikely that might be, and I don't see how this could be done with any other ordering[4]. Meanwhile, a lot of places seem to suggest that SeqCst is rarely necessary, hence why I'm so iffy about this.

remove and Relaxed

pub fn remove(&self, value: &u32) -> bool {
    if self.is_empty.load(Relaxed) {
        false
    } else {
        self.remove_slow(value)
    }
}

This does a Relaxed load to determine whether to use the fast path or not. My reasoning for not using a stronger ordering is that, because insert and remove_slow already use stronger ordering to guarantee that the value of is_empty is logically correct with respect to the state of set, a Relaxed read is allowed here, the worst-case scenario being the occasional false negatives, is this correct?

Any resources, and critique on how I use the terminologies, are also appreciated. Thanks!


  1. The actual problem: this is a packet processing program and I need to track certain packets to prevent them from being processed more than once, except it is not possible to modify or tag them, so I decide to temporarily store seen packets in this set and compare subsequent packets with them. â†Šī¸Ž

  2. I titled this "premature optimization." I doubt that lock contention is going to be that much noticeable during real-world use. I am also aware of mature libs that implement concurrent sets/maps, so this is more about me trying to understand atomics. Still, during benchmark, when running remove in a hot loop across multiple threads, in the best case scenario (set is empty for all but the first check), the optimized version takes 4% of wall time to complete compared to the mutex version. â†Šī¸Ž

  3. RwLock didn't make sense at the time because if an item is indeed in the set after checking, the lock still needs to be write for the item to be removed. Although given the usage pattern, one could argue that promoting a read to a write in such cases would still eliminate most of the contention and will likely suffice. â†Šī¸Ž

  4. I did also ask Claude Sonnet and GPT-5 about this, except neither conversation felt trustworthy. ChatGPT did suggest that is_empty.store(false, SeqCst) be replaced with an acquire fence: is_empty.store(false, Release); fence(Acquire), but I don't see how this is correct at all. â†Šī¸Ž

Let's start so: what specific program executions are you trying to prevent?

The one where "thread 2 might have seen a value in the set, but the flag is_empty was still up so thread 2 did not look" is not distinguishable from "thread 2's remove completed before thread 1 even started inserting a value".

No. Your program has only one atomic variable, so SeqCst is the same as using Acquire and/or Release.

This is Fundamentally wrong. A release write only makes a difference when paired with an acquire load on the same atomic variable. If there is no acquire load in your program, then a release write is equivalent to a relaxed write.


Your program is fine. All writes to the atomic are done with the mutex held, so there is no risk that it has the wrong value. Relaxed is sufficient.

3 Likes

Thank you. I do still have a follow-up question regarding "instruction reordering."

(This is also in response to)

Within insert:

self.is_empty.store(false, Relaxed);
set.insert(value);

Is the reordering of the two statements during execution possible? Is "reordering" even relevant here? In other words, I'm thinking of the following sequence:

  1. Thread 1: insert set.lock()
  2. Thread 1: insert set.insert(value) (reordered with 4?)
  3. Thread 2: remove is_empty.load(Relaxed) == true (takes fast path which is incorrect)
  4. Thread 1: insert is_empty.store(false, Relaxed) (reordered with 2?)

I don't think that's actually incorrect. It's impossible to tell the difference between this execution, and the execution where is_empty.load() happened a bit earlier.


Either way, instruction reordering is the wrong way to think about atomics. There's a much better way:

  1. Instructions execute in exactly the order you wrote them.
  2. But, atomic operations might read an old value, unless there is synchronization between the atomic read reading the value, and the atomic write that wrote the value.
1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.