Lifetimes of references exposed via FFI: is this UB?

UB is about allowing the compiler and optimizer to assume a condition never ever happens. They can use this assumption in many different ways in many different places, not necessarily in a direct and obvious way, and can start taking advantage of it in the future even if they don't actually use it yet now.

There will be places where this assumption is irrelevant in practice, but UB is forbidden in principle. Letting users to start reasoning whether UB is actually generating invalid code in some particular scenario breaks the purpose of UB: compilers can assume it never ever happens, not that it can sometimes happen, but that's ok.

This is an old controversial topic. It has been discussed a million times. This has been a thing in C and C++ for decades, and it's not going away. Please check out other discussions about it.

4 Likes

This is indeed the point I was trying to get to.

@H2CO3 's point above was that even re-interpreting the raw pointer as a &MyThing implies a read of the invalid value, and thus an invalid value has been "produced" in the sense of the list here

If that is true, we have already hit UB at this point, but let's assume for a moment that it isn't, so that we can get to dropping.

Conceptually dropping involves calling destructors recursively on each field of the thing we're dropping. If we're dropping a field f of type &T, does that imply a read of the value of f (the pointer) in which case this is definitely UB, or is dropping a &T defined as a noop (since we don't own the field, we won't recursively drop it)?

Let me also clarify that this is purely for curiosity at this point (I have already added references on the other side of this FFI to make sure things are dropped in the correct order). As a programmer writing FFI code, it's on me to ensure I'm not breaking invariants on which the compiler relies. Plus, I find it interesting... :slight_smile:

Dropping a dangling &T is well-defined behavior, as weird as that may seem. Observing that the &T is in fact dangling is still undefined behavior, though. Take a look at the drop check and PhantomData sections of the rustnomicon for details.

Yes, it is true. There are proofs: ptr::addr_of and ptr::addr_of_mut.

However, I can't understand why such an operation is UB, how can bad stuff happens there?

So why ptr::addr_of and ptr::addr_of_mut is introduced to avoid &*(base.field)?
Compiler must load the memory to calculate the field address?

Can't the compiler just translate this expression to an add operation: base_address + offsetof(field, struct), and re-interpret the result to raw pointer?

No, this has nothing to do with what code the compiler actually emits. UB isn't defined at the hardware or assembly level. UB is defined in terms of the high-level constructs of the language. addr_of() isn't necessary because it compiles to different code than &invalid_ptr.field; it's necessary because the latter conceptually results in the use (and even dereference) of an invalid pointer. It doesn't matter what code gets emitted from them – if at the high-level, a dereference or use of an invalid pointer occurs, it's UB because the language says so.

I gave an example earlier where a reference must be valid even if you don't read from it. Was the example unclear in some way?

1 Like

My question is why &((*raw_ptr).field) can't be compiled directly to the code same as ptr::addr_of.

I understand your post above, however, the topic is trivial, it maybe be something different with the memory preloading.

In optimization the preloading is reasonable, but &((*raw_ptr).field) is not reasonable to do a memory loading.

To my understanding, seems it is because of the expression literal is indistinguishable from the normal &ref, so declare it to be UB at the cost of introducing ptr::addr_of.

I mean, you could change the rules for references from

A reference must point at a valid value.

To

A reference must point at a valid value, except for these cases: ...

But making the rules more complicated should generally be avoided.

They usually do compile to the same code, and this is also what @H2CO3 said, although the sentence was a bit convoluted.

The point is: if I can prove this:

Then I can say &((*raw_ptr).field) is in fact NOT an UB
Then unsafe { Box::from_raw(thing) } is NOT an UB in fact

Ok, but an UB should at least sometimes (if not always) leads to unexpected behavior, right?
I can't find the possibility for &invalid_ptr.field ;

No. There's no such requirement. UB isn't a synonym for "crash". UB means anything can happen, including nothing.

1 Like

No, UB includes possible nothing, not definite nothing

&invalid_ptr.field is definite nothing. The code in final binary is same with (base + offset) as *const T

UB allows the compiler to do anything. Doing what you wanted it to do is included in "anything".

Or in other words, just because the compiler happens to do what you want on this particular example does not prove that there is no UB.

No.

Whether the compiler can detect the UB or not is completely irrelevant to the question of whether the program has UB or not.

No, there is no such requirement.

I'm not sure whether it's true, but its possible that the compiler today never breaks expressions that involve &invalid_ptr.field. (It probably isn't true especially if invalid_ptr is a null pointer, but whatever.)

However, it would still be UB.

Among other things, UB means that we reserve the right to break your code in the future. Therefore, &invalid_ptr.field is not definite nothing, as we explicitly reserve the right to break such code.

That said, there's no requirement that we actually break the code in the future.

1 Like

@zylthinking1 I came up with an analogy that you might find helpful.

A: I want to do X.

B: I can't promise that X will work.

A: I tried doing X and it worked. Therefore, I have a promise from B that X will work.

Do you see the problem in this conversation?

Yes, the problem is familiar.

However, you have to admit for the &invalid_ptr.field, there are no many choices for compiler to do.

If there is another choice, then I will understand why &invalid_ptr.field is UB.

Maybe I failed to make myself clear, from the beginning post mentioned ptr::addr_of, I in fact want to find an explanation telling me there is another choice to compile &invalid_ptr.field to something else, which the compiler prefer than (base + offset) as *const T

But if it turns out only one choice....

The reason that &invalid_ptr.field is UB because the rules are a lot simpler that way.

If this is the only reason, may I say it implies &invalid_ptr.field always compiles to (base + offset), then although it is UB, it is 'definite' well defined.

Then unsafe { Box::from_raw(thing) } will always keep safe, except it is UB in principle.

Well, this kind of reasons is generally dangerous. What if your argument is wrong? And it is usually not too difficult to change the program so that there is no UB. So therefore I ask, why would you make such an argument instead of simply writing your program without UB?

I think it is probably not the case that this always works. In particular, I think it is possible to find an example where this miscompiles if invalid_ptr is a null pointer — LLVM is quite happy to remove code that dereferences a null pointer.

However, the fact that such an example probably exists is not the reason that it is UB. The reason is that the rules are simpler that way.

There are other choices a compiler might sensibly make. For example, when inlining a function such as this:

pub fn certain_function(a: *mut S, x: i32) -> i32 {
    if x > 100 {
        42
    } else {
        let foo = unsafe { &((*a).field) };
        *foo
    }
}

the compiler may be able to look at surrounding code and figure out that when x <= 100, foo would be a dangling reference, which allows it to avoid computing *(a + offset), and instead convert the program to the following:

pub fn certain_function(a: *mut S, x: i32) -> i32 {
    if x > 100 {
        42
    } else {
        let number = 42;
        let foo = &number;
        *foo
    }
}

which then enables further optimization:

pub fn certain_function(a: *mut S, x: i32) -> i32 {
    if x > 100 {
        42
    } else {
        42
    }
}

and finally

pub fn certain_function(a: *mut S, x: i32) -> i32 {
    42
}
2 Likes