I am looking for a way to perform a what I though a simple operation - read an arbitrary typed value from "volatile" memory (think a shared buffer between processors) and write it to a provided local buffer. Naively:
This works OK for small-ish types (as long as the whole operation can fit once into general purpose registers). However when T starts to grow beyond a certain size, this function starts to consume stack space roughly proportional to the size of T. In addition it looks like it is never implemented as a copy loop, but instead a long list of loads and stores (well, at least on the "production-level" optimization levels 3, "z", "s"). Here is a demonstration: Compiler Explorer
Obviously this operation can be done more efficiently, namely load a word from source to a register, write from the register to the destination, without involving the intermediate stack space. Also can be done in a loop and unrolled in accordance to the the general unrolling policy specific to the opt level.
Is this a missed optimization bug or something intentional? How would one implement (in stable rust) this operation without the extra overhead? The only way I can think of is to manually reinterpret T as array of words and copy them in a loop. But this doesn't sound like a proper solution.
Obviously this operation can be done more efficiently, namely load a word from source to a register, write from the register to the destination, without involving the intermediate stack space. Also can be done in a loop and unrolled in accordance to the the general unrolling policy specific to the opt level.
Is this a missed optimization bug or something intentional?
I would guess that the restricted nature of read_volatile inhibits optimizations here.
This is an entirely reasonable solution. bytemuck can help you statically assert the operation is sound with respect to the type T.
However, your code is incorrect in two ways unrelated to your question, because it is using techniques not appropriate for shared mutable memory.
You cannot use copy_from: &T, because having an &T asserts that the T is not being mutated (unless T itself contains an UnsafeCell). You must use a raw pointer instead of a reference.
… Other than that, all the usual rules for memory accesses apply (including provenance). In particular, just like in C, whether an operation is volatile has no bearing whatsoever on questions involving concurrent accesses from multiple threads. Volatile accesses behave exactly like non-atomic accesses in that regard.
Thank you, I guess I'll go ahead and reimplement it as suggested...
I would guess that the restricted nature of read_volatile inhibits optimizations here.
Makes sense that it would inhibit some optimizations here, like actual access elimination and reordering.. But as far as I can tell the volatile semantics can be preserved while optimizing the non-volatile part of the operation.
You cannot use copy_from: &T, because having an &T asserts that the T is not being mutated (unless T itself contains an UnsafeCell). You must use a raw pointer instead of a reference.
But it isn't being mutated in this function? Or you mean that it could be mutated due to it's nature of being volatile (like by the other processor or as a side-effect of the access)?
volatile is not for shared mutable memory...You must use atomic access, not volatile access.
I understand it is the case if there is a truly concurrent access. But in my case it is synchronized by other means, so my only concern is that the data being read is actually the one that was written and not something that the optimizer believes sits in there.
Just to clarify, I think even &UnsafeCell<T> and &[UnsafeCell<T>] would have problems for this use-case. We would need something like VolatileCell with language support.
Outside of UnsafeCell, it is UB for the memory pointed to by an &T to be mutated while that &T is in use.
In that case, having &Tin this function is fine and you do not need volatile access. This case is just like Mutex and RefCell: as long as the memory is not being written to, you can make and use an &T to it. You just have to make sure that no &T to it exists when it is being modified.
It doesn’t make sense to use read_volatile with &T. By using &T as a parameter, you are telling the optimizer that the data isn’t changing and can be read at any time for as long as the function body is executing, so the properties of read_volatile inside of the scope of the &T are moot.
It doesn’t make sense to use read_volatile with &T . By using &T as a parameter, you are telling the optimizer that the data isn’t changing and can be read at any time for as long as the function body is executing, so the properties of read_volatile inside of the scope of the &T are moot.
I am not sure I follow. As far as I understand having copy_from: &T tells the compiler that the function will not use copy_from for writing. Moreover the specific synchronization mechanism might also provide the guarantee that the underlying memory will also remain constant during the execution of the function. However, if we remove the volatile semantics, the following two seemingly identical invocations:
could be very well optimized into a single one as with the "regular" memory the post-conditions are identical. However if src is changing by external processor between these invocations this assumption would be invalid.
No, the existence of a shared reference is a much stronger condition than that: it tells the compiler that the pointed-to memory will not be written for the duration of the function call that accepts the reference. Period. It must not be written at all, regardless of whether Rust code is responsible for the write.
A function with a shared reference in its signature is not just promising it will not write to the pointed-to memory, but also requiring of the caller a promise that nothing else will.
Regardless of whether or not you use volatile reads, this code is invalid if any such writes happen. You must not have a shared reference during any period where *src might be written to. You need to drop the shared reference and stick to using raw pointers (or possibly UnsafeCells as in Mutex) during those times.
No, it couldn't. Normally large copies are delegated to memcpy call. And that woudn't work with volatile, for obvious reason. Compiler doesn't include any special optimizations for large copies because memcpy is supposed to have them already.
And it also tells compiler that no one else would change it behind compiler's back.
Precisely. What exact rules pointers have to follow is still under active discussion, but pointers exist specifically to relax these restrictions… the only question is how much they relax them.
More information in that blog post, but the core idea is that: it's not possible to satisfy everyone simultaneously thus there are few competing offers — and the less you ask from the memory model the more likely is that your program wouldn't need to be changed later. In particular guarantees that references provide to safe Rust code are the most onerous and the least likely to ever be broken.
This needs to be atomics. Volatile is for interacting with non-main memory.
This concern is incompatible with your assertion that it is synchronized by other means. If the synchronization was done, then the CPU can't read wrong data because it's synchronized. If the data is being concurrently written, then you didn't synchronize with those writes.
Look into atomic ring buffer libraries for an example of how to do properly synchronized IPC.
This isn't main memory, it is a dedicated shared memory disjoint to the firmware memory map. As for the synchronization mechanism (for simplicity) consider a register that is written by the "producer" processor once it is done populating the shared buffer and cleared by the "consumer" once it is done reading it. Rust or ISA level atomics won't really help here because they only promise atomic access with respect to the threads running on the same processor.
From description it doesn't look like you need atomics or volatile here at all. You would just need fence there.
“Producer processor” couldn't write into register of the “consumer processor” thus your mechanism would need to provide some way of ensuring “happens before” relationship between them. Most likely atomic.