What types have all valid bit patterns

Let's define a property POD (plain old data), meaning that the data is a contiguous block in memory and all bit-patterns of the data are valid.

  • u8, i8, u16, i16... u128, i128 have this property.
  • f32/f64 do I think, but I'm not 100% sure.
  • raw pointers (*mut/*const) I don't really have a clue. My intuition says no, since although they are conceptually unsigned ints, is it guaranteed that they are in that format? Maybe we have to treat these as opaque.
  • structs - I feel like these should be if all their fields are. Even if there is padding, you don't care what's in it anyway.
  • enums - These can't be unless you use a u8 discriminant and have exactly 256 variants, for example, so best just assume they are not.
  • zero-sized types and ! - I'd say these are since they have 1, 0 representations respectively, but it's probably academic.
  • other pointers - I assume these aren't since raw pointers aren't.
  • unions - Yes if all variants have this property
  • any other types I haven't thought of (I'm just avoiding trait objects since they can't be)

I want to understand this property since it is important for writing unsafe code, and understanding the behavior of unions. Also, it seems that the compiler can assume that invalid bit patterns never exist - meaning that you can get weird UB (that might randomly appear in a later llvm) if you do things like take references to it, even if those references are never read. Is this true? I don't feel like I really understand.

raw pointers don't have to point to a valid type instance and can be null, so i would imagine that any bit pattern works for them. references can't be null and have to point to a valid type instance, but I don't imagine there's any constraint on their bit pattern other than != 0

1 Like

I guess what you're asking about is converting between data types. The nomicon covers the topic in some good detail:

Interestingly, it depends on the target architecture! To quote from the wikipedia article on x86-64:

Although virtual addresses are 64 bits wide in 64-bit mode, current implementations (and all chips known to be in the planning stages) do not allow the entire virtual address space of 264 bytes (16 EB) to be used. ... in the first implementations of the architecture, only the least significant 48 bits of a virtual address would actually be used in address translation (page table lookup).

Canonical form addresses run from 0 through 00007FFF'FFFFFFFF, and from FFFF8000'00000000 through FFFFFFFF'FFFFFFFF

This means that there is a range of values for which a reference is known to be invalid on x86-64 (at least in current implementations).

2 Likes

Yeah, I thought about bringing up the x64 thing but that's more of a side-effect of the way the architecture decides to treat pointers, isn't it?. I imagine Rust itself would be perfectly happy to handle pointers and references with those bit patterns, but it so happens that the architecture will never give it any to play with (modulo pointer tagging shenanigans I guess).

One could also imagine a compiler feature that has knowledge of this particular architecture implementation detail, and protects against compiling code that violates it. But thinking about it, I'm not sure it would be any more useful than the current unsafe keyword, (which is required for touching raw pointers). After all, it should not be possible to create references to the non-canonical address range in safe code.

Indeed. References already have to point to valid objects on pain of UB, so a valid reference is only ever going to point to something within whatever address space the arch gives you.

Also I don't think that bit pattern alone can tell you whether a type is a POD or not. i.e. add a Drop impl to a newtype wrapper around an integer, and now there's more meaning behind the type than just its bits.

Don't references also have to be aligned? At least, this would be implied by pointing to a valid instance. That means some of the LSBs must be 0 if the alignment is greater than 1.

3 Likes

Ah yeah, that would probably be true too.

  • f32/f64: yes, everything is valid, just a great deal of them are NAN
  • It's safe and reversible to cast a usize to any (thin) raw pointer type, so yes, raw pointers can have any bit pattern
  • As others have said, references need to be non-null, aligned, and point to a valid object of the type. (Aligned needs to be mentioned separately because of ZSTs, where anywhere is a valid object in the "it can be read by ptr::read_unaligned" sense.)
  • For structs what you say is technically only true for repr(C) -- repr(rust) (the default) is technically allowed to include arbitrary, important extra information should the compiler deem it necessary. (Not that that actually happens today in any situation of which I'm aware.)
  • unions I suspect the answer isn't actually finalized yet, since it depends what the rules end up being around whether the semantics are defined in terms of which variant was assigned, as just splatting bits in wouldn't set any of the variants as active (in a official semantics sense, obviously not in a "something tracked in release code in memory" sense).
2 Likes

Partly, but also partly because I thought it was theoretically possible for optimization passes to make wierd UB happen when you violate things like this. So what happens if I do:

let val: u8 = mem::uninitialized();
println!("{}", val);

Because every possible bit of data is valid, is this not UB? Or is it still UB because the compiler assumes val is never assigned to and optimizes it away?

Pretty sure it's still UB. Try running this program in both debug mode and release mode and you'll see some interesting behavior:

Yup that’s UB. You’ll find the discussion in this recent thread relevant: How to allocate huge byte array safely - #42 by scottmcm

Nope.

1 Like

What are you trying to show? u8 has alignment 1 -- but try u16 and you'll see their LSB=0.

(Bringing unsafe to this kind of question is shaky, but you could just debug-print their pointers instead of using transmute.)

AFAIK, references must be aligned. Nomicon lists unaligned ptr read/writes as UB, so that would certainly carry over to references.

A “future” rustc version may decide to get clever and store data in the alignment bits.

I see, sorry I misunderstood your point about alignment. I was showing that a reference to a struct member is unaligned WRT the struct itself, but that's kind of pointless. We're aligned now (pun intended).

FWIW, the struct itself also only has 1-byte alignment. Aggregates are aligned to the maximum alignment of their members, unless you force it larger with #[repr(align(N))].

That makes perfect sense. :+1:

It's UB, but would it cause a segfault?

By definition, “anything” can happen. Curious why you’re asking about segfault specifically?