Help understanding the memory ordering in `std::sync::OnceLock`

Greetings

I'm studying OS and inevitably encountered some memory ordering. After I had some ideas what they are meant for, I encountered a place where I need to implement something just like Rust std::sync::OnceLock in C (simply busy loop while waiting i.e. without futex: it's merely a small section during boot), so I decided to make sure I understand how Rust did it first.

But then I have some trouble figuring this out: why does the CAS success case need Acquire?

AFAIK Acquire and Release come in pairs or they don't serve much purpose except when implementing interrupt handlers since the reordering (or rather, lack of) is transparent to single thread executions. So it's probably to synchronize with the CompletionGuard, since that's the only place Release ordering is used.

The impl<'a> Drop for CompletionGuard<'a> sets Once::state_and_queued to either COMPLETE or POISONED, and in this particular case in Once::call, it's the latter case, for we're discussing the CAS success case.

Together with the fact that the modification order for Once::state_and_queued is comprised only of RMW/CAS operations which make release sequences, I suspect this Acquire here is to make sure for all the past functions that had been put here (i.e. those impl FnOnce), all artifacts of all the past impl FnOnce are "visible" to this current invocation of Once::call with ignore_poisoning parameter set to true: when user wants to clear the mess as historical invocation(s) might have panicked, we'd better make sure this latest invocation of Once::call sees all the things that had happened.

Does such a statement hold any water? If this statement is in itself fine, I'm wondering if it's indeed the full story: am I missing something?

If you are really interested in this stuff, how it works in general and how it can be used from rust, then https://www.youtube.com/watch?v=rMGWeSjctlY&t=3469s should be a good starting point. I watched it just a couple of days ago and really liked it.

I'm no expert in this, but this statement seems a bit inaccurate. Isn't the whole point of this implementation to run things in parallel? As I understood it, re-ordering could also happen in other threads.

BTW: Did you see this: rust/library/std/src/sys/sync/once/queue.rs at 40dacd50b7074783db748d73925ac5c3693a7ec1 · rust-lang/rust · GitHub
It seems to address at least some of your points?

The channel is without doubt legendary, thx.

My wordings are probably a bit off; I was trying to explain my reasoning. Acquire and Release, among other things, prevent compiler/hardware reorderings, but these effects are irrelevant if one has only one thread of execution (except when dealing with interrupts), so the purpose of Acquire in the Once::call must be to establish synchronize-with/happens-before relation with some Release in a concurrent setting, and in that module the only place we have Release is in the impl Drop of CompletionGuard, so I assumed that's the main reason Acquire is required here.

In general I think I've understood the basic ideas of how memory orderings work, just wanna make sure the exact reasons behind the Acquire ordering here in Once::call: if we're just making sure exactly one thread got to execute the impl FnOnce, Relaxed should suffice here, for the synchronizes-with and happens-before relation are guaranteed by the Acquire on COMPLETE, so it's gotta be related to CompletionGuard and the POISONED case, no?

The comments in queue.rs are indeed relevant and interesting. Thanks for pointing that out.

Here are some other resources I find helpful along the journey:

sabrinajewson, in which the author drew some Rust (C++11) abstract machine execution graphs, which definitely helped a lot for me and prepared me for the more abstract texts like cppreference

preshing, the blog is also legendary, hosting a series of posts digging into memory ordering in terms of how C++11 changed many things

EuroLLVM 2017 in which Viktor Vafeiadis explained why OOTA values are such a headache: it means the C++11 memory model does not follow the DRF property proposed by Adve and Hill, which captures programmers' intuition of reasoning pretty well: hardware implementations should guarantee yielding only executions that agree with those produced on interleaving semantics machines, if the software were to exhibit no data race in the first place on the interleaving semantics machine.

Oh and marabos, which is also legendary. The logic and optimizations behind Arc/Weak are so cool.

1 Like

Relaxed only covers has started to execute. Functionality must meet documentation. call_once