Since you're using your atomic for destroying a value, you might want to check out how Arc implements this. They also use an atomic to control when the value should be destroyed.
I think there is some confusion around here regarding terminology and what the synchronizes-with relationship actually means. Phreshings blog post explains it really well: The Synchronizes-With Relation
I think the key point is to consider what you want to achieve (again taking Phreshing's examples): you can understand the release as the "git push" of all preceding stores and the acquire as the "git pull" so that the following reads can see the "pulled" state. As such preceding stores can not be reordered with a following release and reads can not be reordered with a preceding acquire. The C++ memory model guarantees that for the synchronizes-with relation ship for acquire/release ordering.
Again refer to Phreshing's post: synchronizes with is always a runtime property, as the acquire/release operations may be executed in any possible interleaving by the concurrent threads. What matters is (and what then actually realizes the relationship) is, what you actually do with the acquired value, i.e. if you conditionally execute further code depending on the acquired value (and thus form some type of synchronization).
// Also note that the Acquire fence here could probably be replaced with an
// Acquire load, which could improve performance in highly-contended
// situations. See [2].
//
// [1]: (www.boost.org/doc/libs/1_55_0/doc/html/atomic/usage_examples.html)
// [2]: (https://github.com/rust-lang/rust/pull/41714)
acquire!(self.inner().strong);
If I understand it right, then std uses (normally) a fence instead of an atomic? Looks like there have been some debate on whether fences or atomics are performing better.
The approach with the $x.load(Acquire) seems to be similar to what I proposed here:
But not sure if that's the same or similar. Also not sure if I could do a Relaxed load in assert_txn_backend. I suppose I could.(edit: I think the whole approach doesn't work because if self.closed.load(Ordering::Acquire) reads false, there is no synchronization, but again, not sure.)
To me, synchronizes-with is a formalism that helps juding about whether something happens-before. Could you point to where there was a particular misunderstanding or wrong conclusion? Do you refer to what I wrote here:
Or also to other things?
I'm not sure if that's right. The above code from the Rust repository discards the loaded value:
This was purely about the statement that "[...] a release store doesn't always synchronize-with a load [...]". I found both uses (store synchronizes-with / read synchronizes-with) to be common.
I guess my post was misleading here. What I meant was the fact that the happens-before relationship is not necessarily meaningful for the actual ordering of the control flows. You either have to use the the atomic operations to facilitate some ordering (an atomic store on the releasing side and an atomic read with some conditional control flow on the consuming side). What the acquire/release semantics then give you is some guarantee regarding the reordering (or better: not reordering) of the stores/loads before/after the atomic operations, and thus, what becomes when globally visible for all threads.
As far as I understand the memory ordering libraries in rust, your code snippet shows an acquire memory fence(/barrier). You have to combine these fences with an additional atomic operation to get sth. comparable to the scenario I've described earlier (see: Acquire and Release Fences).
The store(true) happens-before the load because they appear in that order on the same thread. This means that the load is guaranteed to see the store. The only way you can get a false is if some other thread wrote a false between the store and load.
That sounds correct to me. Synchronizes-with is one way to establish a happens-before relationship.
However, one potential issue here is that you're using a store rather than a read-modify-write operation like in Arc. This is necessary for the load to synchronize-with anything that comes before the store.
The logic in Arc is that, since the ref-count decrement reached zero, it must be the last decrement in the total order for the atomic. Furthermore, the last refcount decrement happens-before the acquire fence. We can conclude that the acquire fence must synchronize with all of the refcount decrements that have ever happened on the Arc. Therefore, anything that happens-before any refcount decrement will also happens-before the acquire fence/load.
I'm not sure which load you are talking about. When I wrote:
Then I meant this:
if self.closed.load(Ordering::Acquire) || self.inner_txn != txn_backend.inner {
I don't think this necessarily happens on the same thread as storing the true value. And if this (the self.closed.load) reads a false, then there is no "synchronizes-with" relationship established.
The store is in close_cursors (which is invoked when dropping the TxnBackend). I don't think I need to synchronize with anything that comes before that store.
Hmm, then the documentation on mdb_txn_begin() is misleading: it's not just that you can only have at most one non-child write transaction in a given thread, you can only have at most one in the whole process! Anyway, LMDB internally uses a lock to halt the thread calling mdb_txn_begin() until the other thread has ended (committed or deleted) its write transaction. In particular, it locks env->me_wmutex before beginning a write transaction, and unlocks env->me_wmutex after ending a write transaction.
Therefore, read-only transactions and child write transactions with the same address are always separated by a synchronizes-with edge as a guarantee of calloc(), and non-child write transactions are always separated by a synchronizes-with edge from unlocking and locking env->me_wmutex.
I don't think this assumption of a synchronizes-with edge always existing would be too risky to make. In reality, the opaque MDB_txn type is just an ordinary struct with fields. And every function that receives an MDB_txn * from the user freely reads from and writes to those fields without synchronization. If there weren't a guaranteed synchronizes-with edge between two transactions with the same address, then this would cause many data races between, e.g., writing to fields on the old transaction and reading from fields on the new transaction. The only (trivial) way to prevent that would be to have a separate mutex in each transaction that protects all of its fields. But that would require locking and unlocking it for every single function, which would hardly make any sense, even for a weird C library.
I should probably explain that solution a bit more thoroughly. It's a bit weird and relies on a disjunctive syllogism, and I'm not quite 100% confident about it myself.
So, let's step back from reorderings, and look at possible scenarios. In thread 1, close_cursors() is called on some &mut TransactionBackend, which we'll call txn1. And in thread 2, assert_txn_backend() is called on some &CursorBackend and &TransactionBackend, which we'll call cursor and txn2. (If the calls were on the same thread, the situation would be easily resolved by the sequenced-before relation.) Finally, let's say that cursor truly belongs to txn1.
First, there is the case where txn1.inner != txn2.inner. Then, txn2 is definitely the wrong transaction. This case is trivially caught by checking it against cursor.inner_txn.
Then, there is the case where txn1.inner == txn2.inner. There are two possibilities: txn2 is the right transaction, or txn2 is a wrong transaction with the same address as txn1.
In the first possibility, txn2 is a valid &TransactionBackend. Therefore, due to the liveness rule, everything in assert_txn_backend() (and its caller) must happen-before txn1 is dropped and close_cursors() is called in thread 1. Therefore, the read from cursor.closed in thread 2 always happens-before the write to cursor.closed in thread 1. This means that the Relaxed read part in thread 2 will correctly receive false.
In the second possibility, txn2 is again a valid (but wrong) &TransactionBackend. Therefore, it must have earlier been constructed using mdb_txn_begin(), and its construction must happen-before assert_txn_backend() is called in thread 2. By the properties of RMW operations, there are only two distinct scenarios: either thread 1's write appears before thread 2's write in the modification order of closed, or thread 2's write appears before thread 1's write. In the first scenario, thread 1 reads false and writes true, then thread 2 reads true and writes true. Since thread 2 reads true, it correctly panics.
Let's look closer at the second scenario, where thread 2 reads false and writes false, then thread 1 reads false and writes true. Again, the call to mdb_txn_begin() to create txn2 must happen-before the call to assert_txn_backend() in thread 2. Since thread 1 reads the false written by thread 2, the Release/Acquire pair makes thread 2's RMW on closed synchronize-with thread 1's RMW on closed. Finally, the call to close_cursors() is sequenced-before the call to mdb_txn_commit() or mdb_txn_abort() on thread 1. Therefore, the beginning of txn2 must happen-before the end of txn1. If we take the assumption that LMDB would not allow such an absurdity, then this scenario is actually impossible, and thread 2 must read true in the only remaining scenario.
(The second scenario generalizes to three or more threads: if any threads calling assert_txn_backend() read and write false before thread 1 reads false and writes true, thread 1 will read the false written by some arbitrary thread and create a synchronizes-with edge with it.)
This is a weaker assumption to make, but an assumption nonetheless. I think the only way not to make this assumption would be to avoid trying to compare transaction pointers altogether.
Well, LMDB synchronizes write access by locking the whole database, so there is only one active writer at a time. This is explained in the introduction:
Writes are fully serialized; only one write transaction may be active at a time, which guarantees that writers can never deadlock. The database structure is multi-versioned so readers run with no locks; writers cannot block readers, and readers don't block writers.
This isn't mentioned later in the API docs again.
I don't see a problem with write transactions anyway, as TxnRw is !Send + !Sync and there should be only one EnvRw which will require a &mut self to start a write transaction. EnvRw is (supposed to be) Send + Sync, but I don't think that's a problem because EnvRw::txn_rw works on &mut self.
Okay, I think this convinces me. So given that assumption(s), it's sound to entirely use Relaxed ordering, for accessing the closed boolean, right? Now I'd have a hard time correctly documenting that, but maybe I'll try to go that way.
I didn't understand the part about the liveness rules yet. But in my words, I'd say naively: If txn2 is the "right" &TxnBackend, then txn1 and txn2 are the same. close_cursors requires &mut TxnBackend, so it can't execute at the same time, i.e. no problem here.
That sounds correct.
That sounds correct to me too. So the only problematic case that remains is the second scenario.
Hm, I'm not sure. I wonder if thread 1 reads thefalse written by thread 2. But you are right. On the C++ reference page it says: "All modifications to any particular atomic variable occur in a total order that is specific to this one atomic variable." RMW operations are "atomic", so the RMW of thread 2 gets executed before the RMW of thread 1. Thus thread 2's RMW synchronizes-with thread 1's RMW here.
Yes, close_cursors and the subsequent mdb_txn_commit or mdb_txn_abort are on the same thread.
Because
mdb_txn_begin to create txn2 happens-before the call of assert_txn_backend,
the call (not return) of assert_txn_backend is sequenced-before the RMW on closed in thread 2,
the RMW on closed in thread 2 synchronizes-with the RMW on closed in thread 1
the call to close_cursors (and thus the RMW on closed in thread 1) is sequenced-before the mdb_txn_commit/mdb_txn_abort` on thread 1 (this happens in the same thread: thread 1)
the following would have to hold:
mdb_txn_begin to create txn2.inner happens-before mdb_txn_commit/mdb_txn_abort of txn1.inner
while txn2.inner and txn1.inner share the same address.
I see how this is somehow contradicting.
I'm not entirely sure if this is really a weaker assumption compared to assuming that mdb_txn_begin returns a pointer to a struct which requires synchronization when you want to access it. It doesn't feel much weaker in any case.
But if all these assumptions about LMDB do not hold, would you agree that simply replacing the Mutex<bool> with an AtomicBool where close_cursors performs a store of true with SeqCst and where assert_txn_backend performs a load with SeqCst, this would make the synchronization fail?
So would you agree that instead of needing
I would need:
a fence,
a Mutex<bool> or Mutex<()>, or
an atomic with RMW operations (instead of ordinary stores and loads).
Correct. If I were documenting it, I'd say that the only way for the pointer comparison to spuriously succeed is if a later transaction received the same address. For that to sensibly be sound, the end of the earlier transaction must be already synchronized with the beginning of the later transaction, so we don't need to add our own synchronization on top of that. (Effectively, we're using the transaction pointer as our mutex.)
Correct: if they're exactly the same TxnBackend object, then the lifetimes of the &TxnBackend and &mut TxnBackend must be divided by a happens-before somewhere, since neither is a reborrow of the other. And assert_txn_backend() certainly can't come after close_cursors(), so it must come before it.
Well, if that weaker assumption is violated, and the beginning of the new transaction happens-before the end of the old transaction, then the old transaction and the new transaction visibly exist at "the same time" from the user's perspective. That means that the user could force an operation on the new transaction to occur prior to an operation on the old transaction, even with something like an internal mutex. That's what I find absurd about the scenario, at least given the initial premise that transactions are disambiguated only by their pointer, and not, e.g., by the calling thread.
If neither assumption held, then the Mutex<bool> and AtomicBool/SeqCst solutions would both be unsound. Unless I've made a mistake in my reasoning, the Mutex<bool> solution is only sound under the weaker assumption, and the AtomicBool/SeqCst solution is only sound under the stronger assumption.
Also, I looked at fence() for a while, but I think at best it would allow you to replace one of the RMWs with a read or write + a fence() in this scenario. There needs to be either a Release store on the assert_txn_backend() side or an Acquire load on the close_cursors() side for the fence() on the opposite side to "bind" to.
I think fence() is more useful in cases where you make a Relaxed operation that you conditionally want to "upgrade" to an Acquire or Release, which is how Arc's destructor uses it. Just by itself, it doesn't allow us to reverse the polarity of atomic operations.
I'm still not sure if I really understand the difference between the weaker and the stronger assumption:
Are you talking about the weaker or the stronger assumption being violated here? Or both?
Let me try to recite both assumptions:
The stronger assumption is that if mdb_txn_begin returns the same address as one previously released by mdb_txn_commit/abort, then the commit/abort synchronizes-with the mdb_txn_begin (which is the case with free/calloc, for example). Here we can effectively use the transaction pointer as a mutex (allocation Acquires and releasing the memory Releases).
So I assume you are referring to the stronger assumption here.
But what exactly is the weaker assumption? You wrote:
You assume that the beginning of txn2 must not happen-before the end of txn1. Isn't the only way to assure that to make the end of txn1 synchronize-with txn2? Which other ways would we have to assure that? If they were on the same thread, they could be sequenced-before/after. But if not, don't we need some sort of synchronization here too? I guess the synchronization could be more indirect as in "end of txn1" synchronizes-with X is sequenced-before Y synchronizes-with Z synchronizes-with "beginning of `txn2". But not sure how LMDB alone could do that. Maybe it's a problem of terminology here. Let me try to elaborate:
The actual call of mdb_txn_commit/abort will never synchronize-with the actual call of mdb_txn_begin. Instead, the call of mdb_txn_commit/abort is sequenced-before the (supposed) call to free which is sequenced-before the return of mdb_txn_commit/abort. The call of mdb_txn_begin is sequenced-before the call of malloc which is sequenced-before the return of mdb_txn_begin. So there is already some indirect chain. Thus we should say that mdb_txn_commit/abort happens-before mdb_txn_begin. In other words: "the beginning of of txn2 must not happen-before the end of txn1". Which is exactly what your weaker assumption was about.
So my thesis is: both assumptions are effectively the same. But maybe I'm misunderstanding something?
If I'm wrong, perhaps you can give an example of how LMDB could behave such that the weaker assumption is fulfilled but the stronger assumption is violated?
But assuming I'm wrong…
…then this would mean that simply replacing the Mutex<bool> with AtomicBool/SeqCst using stores and loads would be unsound if LMDB fulfilled the weaker assumption but not the stronger assumption, i.e. using a Mutex<bool> (or atomics with the RMW combination) would be justified.
But if I'm right and both assumptions are the same, then (in this case) the Mutex<bool> could be simply replaced with an AtomicBool using Relaxed ordering. Not what I wanted to show though, but curious anyway.
Oh, I misunderstood something about fences. I thought fences synchronize already "by themselves", but it's apparently not true. They always need an atomic (see fence-fence synchronization here). Again, only the write synchronizes-with the read and not vice-versa, so probably you're right. (Maybe you could do separate stores/loads, but I'm really not sure and don't overlook it at the moment.)
Yeah, that's what I just meant with "only the write synchronizes-with the read and not vice-versa".
That said, I would like to comment once more on making assumptions about what LMDB does:
I could imagine (hypothetically) that mdb_txn_begin returns a fake pointer in case of a write transaction, as there is only one write transaction open at a time anyway (or one deepest nested write-transaction). So from the API docs it's not immediately clear that I can use the pointer addres to distinguish/identify transactions. Now one could say this would be surprising, but while attempting to write a SAFETY comment for mdb_strerror, I was tempted to assume (apart from hoping mdb_strerror would be thread-safe, which it is not in general) that the returned pointer would be valid at least as long as I don't call any other LMDB function. But even this (very reasonable) assuption turned out to be wrong as I pointed out later in that thread, which – I may add – isn't even properly documented! (other than a comment in the source code, which I do not consider sufficient)
During my further process of writing SAFETY comments for mmtkvdb, I found one other and one potential mistake I made. I partly blame this to LMDB's API documentation, though I also wasn't reading all of its documentation carefully enough, it seems:
Docs for mdb_dbi_close claim that "Closing a database handle is not necessary, […]", yet mdb_env_close requires that "All […] databases […] must already be closed before calling this function". This is contradictory. I relied on not needing to close database handles in mmtkvdb, but I feel unsafe with it now. Looking into the source, mdb_env_close does seem to cleanup all database handles. I also tested this in practice and didn't notice any memory leak. Yet I feel unsafe with it now because the API specification is contradictory here.
When using mdb_dbi_open, it may happen that it returns a handle which has been already returned by a previous call. If you pass both handles to mdb_dbi_close, behavior is undefined. This is explicitly mentioned in the docs of mdb_dbi_open (but not in mdb_dbi_close): "The old database handle is returned if the database was already open. The handle may only be closed once." This makes the safety documentation of mmtkvdb's (unsafe) create_db/open_db methods erroneous and must be fixed.
So my point is: understanding the LMDB documentation and its behavior may be tricky. In any case, I will go through the process of properly documenting every use of unsafe. I feel like I should not make too many assumptions about LMDB's behavior because I might be surprised in bad ways.
Sure: the stronger assumption requires LMDB to make the mdb_txn_{commit, abort} of the old transaction happen-before the mdb_txn_begin of the new transaction. The weaker assumption only requires LMDB to make the last operation on the old transaction happen-before the first operation on the new transaction, without necessarily making mdb_txn_{commit, abort} happen-before mdb_txn_begin. This would be done, e.g., by using a mutex to protect the transaction object for every operation, except for mdb_txn_{begin, commit, abort} and mdb_cursor_{open, close}, which are carefully written to be unsynchronized with respect to the transaction object.
If the weaker assumption held but the stronger assumption didn't, a normal C user might not notice (unless they tried to compare transaction pointers), but we'd run into trouble with our assert_txn_backend() scheme. Our check for whether or not the transaction pointer is valid is always sequenced-before the actual operation that uses the transaction and cursor. So under the weaker assumption, we may have successfully returned from assert_txn_backend() even with the wrong transaction pointer, since the asserting thread wouldn't necessarily "catch up" with the closing thread until the operation actually occurs.
Personally, I've often had success applying the heuristic where C library authors will never violate an implicit API assumption if violating it would require far more code than satisfying it. In the mdb_strerror() case, delegating to normal strerror() is a single line of code. But in this mdb_txn_{commit, abort}()/mdb_txn_begin() case, violating the strong assumption would require some tricky lock-free algorithms on those functions, and loads of extra internal synchronization on every other function. (And I don't even want to know how wacky a version violating the weak assumption would look like.) Thus, if we're already assuming that MDB_txn * pointers are unique, then it doesn't feel like much of a stretch to assume that they act like exclusive &mut pointers, where the end of the old lifetime and the start of the new lifetime are synchronized, so that all other operations can be unsynchronized.
then you meant the first operation on txn2 must happen-before the last operation on txn1?
I think I understand now. (Please let me know if I got it wrong.)
So in practice it actually boils down to whether we can assume that MDB_txn * pointers are unique. While the LMDB API doesn't guarantee it (and in theory, a write transaction could have the same address as an arbitrary read transaction), I understand that it may be reasonable to assume it nonetheless.
Concluding (getting back to the subject of this thread), the Mutex<bool> could – in this particular case – probably be replaced with an AtomicBool using Relaxed loads and stores. However, it requires some extra assumptions being true in regard LMDB behaving reasonably (which is not explicitly documented in the specification, and which might be similar with many other C APIs).
Replacing the Mutex<Bool> with an atomic that uses simple loads and stores with SeqCststill requires these extra assumptions to be true. So using SeqCst makes no sense (here), which also matches what @farnz quoted in the other thread:
Seeing this from the other side: If a Mutex<bool>was necessary in this case, the solution to replace it with atomics is really complex and requires a lot of thinking and reasoning, and it doesn't result in simple stores and loads from an atomic.
In that case, LMDB would allow the user to perform an operation on txn2 that happens-before an operation on txn1, since the txn1 thread sees both transactions as valid simultaneously. (I'm using "beginning" to refer to the mdb_txn_begin() call, and "end" to refer to the mdb_txn_{commit, abort}() call.) If it were able to do this in a way that actually worked, then I don't think the transaction pointers could really be considered unique at all.
That same amount of thinking and reasoning is necessary to show that the Mutex<bool> solution is also sound under the weaker assumption. The argument is nearly the same, we just use the happens-before edge between unlocking it and locking it. In particular, in the first scenario, close_cursors() locks the mutex before assert_txn_backend(), so assert_txn_backend() reads true and correctly panics. In the second scenario, assert_txn_backend() locks the mutex before close_cursors(). Then, the beginning of txn2 happens-before the mutex unlock in assert_txn_backend(), which synchronizes-with the mutex lock in close_cursors(), which happens-before the end of txn1; therefore, the second scenario requires the beginning of txn2 to happen-before the end of txn1, which we are assuming cannot occur. Thus, either txn2 is txn1 and assert_txn_backend() reads false (by the first possibility), or txn2 is not txn1 and assert_txn_backend() reads true (by the first scenario in the second possibility).
I suppose that the main reason AtomicBool with SeqCst is generally treated as sufficient is that it's not likely in practice for something to require a Release load or an Acquire store that a Mutex would be able to simulate. In this case, Mutex<bool> would only help over Relaxed operations if the weaker assumption were satisfied and the stronger assumption were violated, which would be a very odd architecture for such an API.
Hmmm, if that would hold in the general case, then it would mean Mutex<bool> is (generally) an indication for considering replacing it with a relaxed atomic, i.e. a potential "anti-pattern". Not sure if the example discussed here is representative, and also not sure if your statement was meant in general or in regard to this particular example.
In this particular case, a Relaxed atomic would work; from what I've seen, everyday Release stores and Acquire loads are generally seen as sufficient for the general case. (I recall hearing something to the effect that spinlocks are the one of the only valid use cases of a load synchronizing-with a store.) But I think Mutex<bool> still has a place, if you actually need to lock it for some amount of time while other threads block on it.
I still didn't work with AtomicBool, but for what I Understand at a first glance is that it is perfect for implementing spin-locks. On the other hand Mutex implements a monitor with an access queue that serialized the concurrent threads. Then, they have different purposes and are not equivalent at all.
I'm not sure about terminology, but I think a spin-lock is some sort of mutex (which consumes CPU while waiting though). Both a spin-lock and an ordinary mutex use atomics to synchronize, while the latter parks the thread and ensures that it's woken up when the lock becomes available.
A spin lock is a mutex […]. Attempting to lock an already locked mutex will result in busy-looping or spinning: […]. This can waste processor cycles,
And:
[…] a regular mutex […] will put your thread to sleep when the mutex is already locked.
So what makes a mutex a mutex is the way it uses atomics for synchronization. Instead of using an atomic directly to store the value to which you want to have synchronized access to, the atomics are used to store the lock-state (and you store the variable which you want to access elsewhere).
In regard to the example of a Mutex<bool> it means that a Mutex<bool> consists (at least) of twoatomicsvariables: one used for synchronization (an atomic) and one to store the actual boolean value (e.g. an UnsafeCell).
Yes, this is exactly the point. A spin-lock is non-blocking while providing mutual exclusive access to its value. Mutex provides a semaphore (https://www.cs.utexas.edu/users/EWD/transcriptions/EWD01xx/EWD123.html), where all threads waiting for accessing to the object Mutex synchronizes are queued. Interesting enough that at hardware level Mutex is implemented using a spin-lock.
After this long discussion, I decided to actually use the AtomicBool approach (source). In that context, I also invested some effort to add SAFETY comments to all uses of unsafe in mmtkvdb's code. I found it very helpful to enable the unsafe_op_in_unsafe_fn lint.
Adding the SAFETY comments, I found several soundness issues (critical bug fixes in mmtkvdb 0.14.1). It turned out that doing FFI calls in a sound way can be a really hard task. Especially LMDB comes with a lot of preconditions combined with sometimes-automatic releasing of resources. One of the more twisted examples from the LMDB API specification:
A cursor cannot be used when its database handle is closed. Nor when its transaction has ended, except with mdb_cursor_renew(). It can be discarded with mdb_cursor_close(). A cursor in a write-transaction can be closed before its transaction ends, and will otherwise be closed when its transaction ends. A cursor in a read-only transaction must be closed explicitly, before or after its transaction ends. It can be reused with mdb_cursor_renew() before finally closing it.
Note: Earlier documentation said that cursors in every transaction were closed when the transaction committed or aborted.
What I (think I) learned from this discussion and from reviewing all my unsafe uses:
Mutex<bool> gives guarantees over AtomicBool. Often, the synchronization properties of Mutex<bool> might not be needed. But if you need them and you want to avoid using a Mutex, then you'll have to use more complex patterns using AtomicBool, rather than simply using SeqCst. SeqCst loads and stores most likely will not give you what you need. And remember: SeqCst only gives any advantage over Acquire/Release/AcqRel if there is more than one atomic involved!
Sometimes, synchronization happens implicitly. An example is malloc/free in C. But there may be other cases, e.g. Tokio's Notify, which may result in synchronization (see also this post by me in "A flag type that supports waiting asynchronously"). I think many APIs don't explicitly specify when or if such synchronization takes place. Being careful about that, one might end up using a lot of unnecessary extra-synchronization (as in my case of using Mutex<bool> where it (likely) wasn't necessary).
C API specifications, in particular, are often underspecified. You might need to make certain (reasonable) assumptions that aren't explicitly documented.
Writing sound unsafe code in Rust is harder than I imagined. Explicitly writing down the reason why each and every unsafe block is sound can be helpful, even if it seems "obvious" at a first glance. Using the unsafe_op_in_unsafe_fn lint is a valuable tool as well.
Of course, Mutex<bool> is a safe fallback. If you use that, you don't need to worry about synchronization. You might still endup with deadlocks though.