Firstly, I'm wondering about why unaligned references and pointers are UB. One reason I'm aware of is some hardware may not support unaligned reads and writer. Are there others? Also, are there any cases where references differ from pointers in terms of what's UB and what is not? In the hardware currently supported by Rust, which ones do not allow unaligned reads or writes?
That page has a few mentions of "unaligned", but it doesn't say what alignment means. My guess is it's "hardware word size aligned", but even then the motivation is not explained. I don't know much about hardware so would be good to have some pointers to learn more about these.
Secondly, I'm curious about uninitialized integers. All 32-bit bit patters are valid for 32-bit signed or unsigned numbers, so I'd expect an uninitialized 32-bit memory to be valid i32 or u32. Why is this not the case? Is this to keep representation of numbers abstract? (i.e. it's not guaranteed that they'll be represented as 32 bits, some bit patterns may become invalid in the future)
If something has an alignment of, say, 8 that means that all pointers to that type must point at a memory location that is a multiple of 8. This is necessary because many assembly instructions only work with aligned pointers.
Regarding uninitialized memory, please check out this article.
No, it's because reading uninitialized memory is always UB. It doesn't matter that it is an "integer". Uninitialized means no value there, and not "a random bit pattern that happens to be valid".
My guess is it's "hardware word size aligned"
No, it is not always word size. Every type has its own alignment, which might be any power of two >= 1. In practice, it usually stops at 16 or so, but you can make types of greater alignment using the #[repr(align(N))] attribute.
I wasn't aware of this, thanks. In case anyone's wondering, this page mentions that integer types on SPARC have differenct alignment restrictions based on their size.
Uninitialized means no value there, and not "a random bit pattern that happens to be valid".
I didn't say "random bit pattern that happens to be valid", I said "random bit pattern, and I know that every bit pattern is valid for the type".
I'm still curious about the details. What is the problem with assuming "random bit pattern" here? What are the problems, from a lang design or compiler implementation perspective with that assumption?
The problem is that the optimizer has much greater flexibility if it can assume that… well, you are not violating the rules. If you ask the compiler to make sense of non-sensical code, it will either fail and make your code misbehave (because optimizations rely on the fact that you follow the rules), or it will be really conservative in all of its asaumptions all the time, practically meaning no optimizations, but at least the buggy code "works".
The trade-off is pretty clear in the decision of most modern systems languages. Inhibiting automatic optimizations is a huge cost, while it's not at all hard to not read uninitialized memory. Thus, languages including Rust choose to optimized based on the assumption of correctness, instead of trying to make incorrect code work somehow at the cost of optimizations.
Basically there are some optimizations that it would be nice for the compiler to be able to perform, but those optimizations would be incorrect if uninitialized memory didn't behave as it does.
I can think of a couple of cases of "random bits":
When you first apply power to your computer your processor starts in some well defined state, the "reset state". This is forced by the hardware circuitry and logic design of the processor. For example all registers are set to zero, the Program Counter is set to some initial address at which it should start running code, etc.
However, you also have gigabytes of memory. Those memory cells do not have reset circuitry. They have to come up in some state, all zero, all one or some unknown mess of zeros and ones. In fact, after a brief power interruption the memory may well contain exactly what it contained prior to the interruption. The processor meanwhile having been reset to start over.
None of this is exactly "random bits" but it is unknown.
When local variables are used in a function we expect them to be on the stack. Well, the stack is used and reused as functions are entered and left. It could well happen that an uninitialised local variable could contain some value of some other local variable form some other function executed previously. Again not exactly random but garbage anyway.
2a) When a program makes a heap memory allocation it could likely be give memory that has been previously written by some part of the program earlier.
The question in my mind then is: How does one read such uninitialised memory in Rust? After all it does exist physically. It does contain some value. Perhaps we want to know what it is.
I do not understand this statement. If one reads a value but does not use it isn't that the same as never reading it?
Presumably the compiler can see that the value is never used and optimise away the read, as it makes no difference to what happens. And then, presumably the compiler can entirely remove the initialised variable, after all it is there for nothing.
A common assumption is that multiple normal reads (so excluding volatile, atomic and stuff like that) from the same memory address will return the same value unless the current thread has written to it in the meantime. This however is not valid for uninitialized memory, for example the OS can actually change the value of an uninitialized page under your feet if you never wrote to it, see for example (the quote in) this comment "What The Hardware Does" is not What Your Program Does: Uninitialized Memory - #27 by Amanieu - Rust Internals
Wow, that thread is wild, thanks for sharing it. Never knew that there are 8 states at the hardware level for a single bit, and you can get different values when reading pages freed by MADV_FREE.
I think the main takeaway from these threads for me is an undefined behavior may originate from different layers of the execution stack, not just hardware. Example:
Hardware, e.g. reading unaligned pointer on ARM.
OS, e.g. reading memory freed with madvise(addr, length, MADV_FREE)
Secondly, I'm curious about uninitialized integers. All 32-bit bit patters are valid for 32-bit signed or unsigned numbers, so I'd expect an uninitialized 32-bit memory to be valid i32 or u32. Why is this not the case? Is this to keep representation of numbers abstract? (i.e. it's not guaranteed that they'll be represented as 32 bits, some bit patterns may become invalid in the future)
Even though behavior considered undefined says uninitialized ints and floats are UB, it seems like this is still being discussed in Validity of integers and floating point · Issue #71 · rust-lang/unsafe-code-guidelines · GitHub so I'm guessing it may change in the future. Second comment in the thread gives some arguments for and against allowing uninitialized ints and floats. Reading that thread, my understanding is there aren't any hardware-level concerns here, it's about language and compiler design.
Regarding the discussion in #71, it's important to note that this would be UB either way:
let i: i32 = uninit;
if i < 10 {
println!("i is small");
}
The question being discussed there is whether the UB happens at the creation of i, or at the use of i in the if.
This is important because if you transmute an (u8, u16) to an i32 and back, then since (u8, u16) has a padding byte, the i32 would contain some uninitialized bits, and if the UB happens at the creation of i, then this transmute roundtrip is UB, but it isn't if it happens at the if.
When I explain why you shouldn't do that in C I just show that “be || ~be” is false in clang when uninit is involved.
But for some reason this code doesn't work that way in Rust - even if the same LLVM is involved as a backend.
I wonder why.