Is this possible or always UB?

In short: I am wondering if I can express something in Rust/LLVM or if its impossible/always UB. It involves reading memory that may be simultaneously written to by a different thread, and that memory is allocated by Rust.


On the side I write and learn about high-performance storage-engines for fun. I want to implement a technique used in Btree-based storage engines called "optimistic lock coupling"

The mechanism is relatively simple but almost immediately UB when expressed in Rust, or LLVM in general infact, in any way that I can think of. I am not a hugely experienced unsafe user though, so maybe someone can come up with a solution.

Optimistic-lock-coupling works as follows in this context:

We have a "frame", which (for the sake of this question) is just some struct that holds a reference to a buffer - a chunk of bytes like [u8; 4096] - and an atomic "version" variable (an AtomicU64 lets say).

We want to allow multiple readers (threads) to optimistically read from the frame without modifiying any value, like doing a fetch_add or CAS on the version atomic, or taking a lock, because of cache coherency concerns. In the happy path a reader just goes: load-version -> read-buffer -> load-version and no cache invalidations are caused.

However, we can and will, at times, have a writer-thread come and write to the page while we're in the middle of reading it. This is why we check the version again at the end, because the writer would increment the version counter before AND after its buffer-write. We WILL incur data races but we will simply discard the read data.

To summarize the read path:

  1. we (reading thread) check the atomic version counter and store the value
  2. we do a read of the page (this involves some SIMD strcmp stuff to find our value)
  3. we check the atomic version counter again, and if its changed we discard our result from
    step 2 and retry all over again.

The problem:

This is instantly UB because we WILL be reading while someone else is writing, and WILL have a data race, however, putting rust/llvm aside momentarily, we are handling this (logically) because we just discard that torn/invalid data, and we don't do anything dangerous like dereference it until we have double checked the version atomic, and ensured that our read did not incur a data race.

And so my question: Is there a way to soundly express this in rust (or LLVM in general) without UB? Or is this genuinely not expressable?


A couple notes here to preempt certain answers:

  1. We could use something like &[AtomicU8] or packed &[AtomicU64], but this will prevent the compiler from omitting SIMD for our strcmp logic. This is unacceptable.
  2. Thing like read_volatile still consider this UB, because a) the data at the ptr is not "valid" for reads (it may be getting written to by another thread) and b) the memory does exist inside the rust allocations (source: read_volatile in std::ptr - Rust). The intent of read_volatile is to reason about memory that may change for reasons outside our program, which is not the case here.
  3. We could write our read logic in asm!(), but thats horrible, and also still maybe UB anyway, I don't know offhand).
  4. This is a real technique used in real systems. Please spare me the "don't design it this way" replies.

Thank you for any insight! I am quite curious about this.

Have you seen the atomic memcpy proposal (github.com/rust-lang/rfcs/pull/3301)? I don't know if that would allow the compiler to use SIMD, but is has your exact usecase as motivation,

I have also found a crate (atomic_memcpy - Rust) that tries to implement the cpp standard functions, but i don't know if that actually works / is well defined in rust.

See also: What about: seqlocks, load-release/store-acquire? Ā· Issue #323 Ā· rust-lang/unsafe-code-guidelines Ā· GitHub
There the crate was also posted and nobody said it has UB, so it seems pretty good.

3 Likes
// from: https://github.com/m-ou-se/rfcs/blob/atomic-memcpy/text/3301-atomic-memcpy.md#the-rust-solution
pub fn read(&self) -> T {
        loop {
            let s1 = self.seq.load(Acquire);
            let data = read_data(&self.data, Acquire);
            let s2 = self.seq.load(Relaxed);
            if s1 & 1 == 0 && s1 == s2 {
                return unsafe { assume_valid(data) };
            }
        }
    }

This RFC is indeed quite applicable here

I'm going to read it more closely later tonight, the one thing I'm not sure about is the SIMD compiler-optimiziation compatibility. I will update this post as I learn more incase anyone else is curious.

Thanks so much for the reference, this fit my question exactly. I guess seqlock is a more general class of this that I wasn't familiar with.

My (not-very-expert-at-all) reaction is you cannot do this in Rust without UB.
That said, I expect you can do it and it will work on your hardware.

1 Like

no because you can have a scenario like this
thread A updates the atomic to the next version eg 1 and is just about to write
thread B reads the atomic as 1 and starts reading
thread A writes
thread B completes reading and sees the atomic as 1
now the chances of this might be incredibly close to 0 to the point that it might not happen for millennia but if you really want no UB you need to consider it

you need two atomics one that is updated on the start of the version switch and one that is updated at the end, reads that start when the two atomics are different must fail immediately, both buffer and read data should be stored in a MaybeUninit so they can't cause UB by having invalid states and the read data should be considered correctly initialized only once the read ends without the start_version_change having been incremented

4 Likes

You're correct that that is a race, I omitted a detail from the actual implementation that handles this case: readers/writers will spin if they initially encounter an ODD version counter, which implies a writer has pre-incremented the version but not post-incremented it. In other words: an odd version is always "locked" by a writer.

Unfortunately, though, its UB either way, because even if we correctly "handle" the case where a writer interrupted our read, reading the data while its being written at all is UB in the first place (even if we discard it after, the compiler can't reason about this as it stands).

Keep in mind reading forward that I'm not an expert in LLVM, the internals of the Rust compiler!

MaybeUninit is mentioned frequently in the RFC but as far as I understand its got some problems (aside from just semantically not quite describing the situation accurately). The read/write itself has to be atomic because it has to not be reordered before or after the initial/final version reads (or writes). You need to be able to enforce some memory ordering on the middle section, fences might be needed, the compiler can't reorder them, etc. You would need something like AtomicMaybeUninit.

Note that when I say "atomic" here I merely mean that the compiler can reason about multiple threads mutating/reading it at once. The operation itself doesn't have to be "atomic" in the traditional sense of the word. It just has to be non-undefined-behaviour to read it while its being written to and needs certain memory ordering gaurantees and such.

(The following is mostly hearsay that I have a limited understanding of)

The problem is (at least partly) that LLVM doesn't have any intrinsic that actually does this for a lot of the cases we need. For example LLVM has llvm.memcpy.element.unordered.atomic, but unordered is too weak to be used in this case. It also doesn't solve my problem in particular because I'm not looking to copy anything (for my precise case I just want to do some SIMD byte-slice comparisons to do a binary search).

The more I read/learn about this RFC the more likely it seems that it doesn't quite solve my problem. The Seqlock case is a lot cleaner because often the gaurded value will be Copy, meaning you've already copied the value out of the lock by the time you've resolved whether or not the read was torn. For my case its a bit more messy because by this point we've already "read" and done a bunch of stuff to the data and generated (at least) some derived data from it that we then want to use (such as a page_id: u64 that we found from doing a binary search of the entire page).

If LLVM provided a sufficiently strongly ordered atomic memcpy intrinsic then we could clone the bytes out into a MaybeUninit or MaybeTorn and then check for Torn-ness, but obviously its infeasible to clone an entire 4k/8k/1M large database page for every read. We'd need to (compiler-wise) somehow tie this "derived" data we plan to use to the validity of the read.

Also worth mentioning: the current hack out the wild to solve this does indeed seem to be read_volatile/write_volatile. This is UB strictly speaking (but less likely to actually cause a problem).

I'm a bit out of my depth in all this so I'm sure I've said something innaccurate, and that RFC conversation spans a few years and wanders around a lot and has some scope-drift, so to sum it up: I'm just not quite sure :slight_smile:

2 Likes

Acquire and Release orderings on the atomics at the endpoints should be sufficient. The orderings don't merely specific how atomic operations can be reordered with respect to other atomic operations, but also with respect to non-atomic operations.

Yeah, this seems like the main problem that AtomicMaybeUninit (or similar) would solve, not memory ordering. Though why is unordered not sufficient? Looking at some documentation, it seems like precisely what we'd want for soundness... though the guarantees of unordered seem too strong and make SIMD optimizations impossible AFAICT: "a load cannot see a value which was never stored" implies "unordered reads and writes have to be small enough that they do not tear"; a bytewise copy (or copying one native word at a time) would be possible, but not a SIMD-optimized atomic memcpy. Presumably it wouldn't be too damaging to the memory model to add "bytewise unordered" SIMD operations to LLVM?

1 Like

Right, my intuition matches that now that I've gotten a good nights sleep. Although we would need a Release fence between the data-read and the second load, because load doesn't (meaningfully) take Release as an ordering. Rust also doesn't define or expose the unordered ordering in the first place, of course - I'm not sure what ramifications this has though (if any). Maybe you can implement this using LLVM unordered intrinsics but then still give Relaxed (or above) guarantees - even a statement as vague as that though is way out of my depth honestly

And yes, as you say we have even bigger problems for SIMD or any type that can't even be guaranteed to satisfy the atomicity requirements unordered. Thats kind of a whole other somewhat orthogonal issue. "Bytewise unordered" for POD types makes enough sense to me... beyond that I don't know. My usecase, again, is quite tricky, because I want to genuinely do operations on possibly undef data (or whatever the correct term is in LLVM), as opposed to just copying (in the Rust trait sense) or memcpying it.

I will keep researching but I'm far out of my depth once it comes to speculating whether or not we could represent this or that abstract thing in the semantics of LLVM, so its possible I don't have much more to add.

i guess you can do inline assembly, assembly always does what the assembly says after all so you just have to be careful to make the interface with the assembly safe

If I understand right you could atomic-memcpy just the parts to perform operations on (and freeze it just after reading if necessary), do the comparison, and only then check the seqlock.

Careful, language developers have strong opinions on that:

RalfJung in How does inline assembly and the physical machine fit into the abstract machine model, or exist outside of it? - #27 by RalfJung - language design - Rust Internals :
The urban myth than inline assembly lets you go down to the concrete machine level and entirely ignore the Abstract Machine is just that, an urban myth.

Which doesn’t mean it’s necessarily impossible to use assembly but, it may be tricky or/and not well-defined at all. Writing both reads and writes in assembly might help though, to move the memory content out of Rust scope altogether.

2 Likes

I wonder if the empty assembler piece that accepts Undefined in one register and then returns ā€œdefinedā€ in the same register is enough to ā€œfreezeā€ value.

What does assembler specification says about that? I think it should work, but I'm not sure if that's defined formally enywhere.

1 Like

That’s discussed in the thread I quoted (and linked). I can only add I’d not dare that in registers; at least Itanium supports actual undefined values there (known as NaT, ā€œnot-a-thingā€).

1 Like

Thankfully Itanium is dead thus we may happily ignore that.

No. It discusses different thing: if you read MaybeUninit in assembler block and it's already initialized then you are still not guaranteed to get anything ā€œpredictableā€ back.

I'm talking about something like this:

let x: MaybeUninit<u64>;
…
let mut y: u64;
unsafe {
    asm!("/* {0} */",
        inout(reg) x => y,
    );
}

Here we very explicitly pass MaybeUninit to assembler and read back not MaybeUninit. Without doing anything in assembler, yes, but it's entirely unclear how and why this thing may differ from freeze. And I'm not sure if we have definition of assembler that precise enough to explain how assembler is supposed to work in that case.

1 Like

Indeed. It’s the parent topic that has the discussion, and almost exactly the same example as yours:

1 Like

For my particular usecase I would ideally not do this, because really I just want to derive some small data (a u64 in this precise case) from some larger operation that involves SIMD reads. Copying and then doing the reads would be slower of course. Similarly, the example "inner values" in the SeqLock RFC were mostly copy.

Concretely, if I need to lexicographically compare some key I have with some slice of bytes on the (possibly torn) page in question, I'd rather not perform a memcpy of that slice.

Logically if a (most likely vectorized) memcpy can be done then any other SIMD op can be done (even if currently unimplemented in LLVM), but the difference with the memcpy case is that you then have a new disjoint version of the data that you can hold (thread) locally and wrap with MaybeTorn/MaybeUninitialized or whatever. With my case the semantics/compiler reason isn't as trivial.

As someone clueless about such matters: With certain strengths of Rust 's type system I would think its at least not obviously impossible though - it seems at least superficially related to analysing lifetimes. (that is: keeping track of whether that derived u64 was the result of a valid read, in the same way that you'd keep track of whether that MaybeUninit data was the result of a valid read or not).

My assumption was that, yes, you can do arbitrary things in assembly, but if you're still violating some compilers/rust-abstract-machine invariant then you still have the same problem. The compiler/AM can't magically handle whatever weird quirky stuff you've done in assembly. This is similar to what RalfJung pointed out in the quote you linked. Again, this is according to my very limited understanding here... at this point I'm mostly talking to be corrected!

Again, I am way out of my depth when it comes to this internal stuff, though, which is why I'm grateful for the discussion and all the links provided here. I'll be off to read the Rust needs a safe abstraction over uninitialized memory thread now in any case, just wanted to give a quick reply.

1 Like