[Solved] Variance of `dyn Trait + 'a`

Continuing the discussion from The Confessional Thread: Parts of Rust that I still don't get after all this time:

cc @Yandros, @ExpHp

I think this is a bug, I tried out some more variations

// none of these compile
// struct Foo<'x>(dyn Trait + 'x);
// struct Foo<'x>(*const (dyn Trait + 'x));
// struct Foo<'x>(*mut (dyn Trait + 'x));
// struct Foo<'x>(Box<dyn Trait + 'x>);
// type Foo<'x> = Box<dyn Trait + 'x>;

// only this compiles
type Foo<'x> = dyn Trait + 'x;

trait Trait {}
const _ : () = {
    fn check<'short, 'long : 'short> (
        it: &'short mut Foo<'long>,
    )
    {
        let _: &'short mut Foo<'short> = it;
    }
    let _ = check;
};

In particular, I'm surprised that struct Foo<'x>(dyn Trait + 'x); doesn't work is different from a trait object. There should be no difference between it and dyn Trait + 'x, but the trait object compiles where the struct does not! It looks like just adding the struct breaks things, which is really surprising.

4 Likes

Interesting, especially the unsized wrapper. I somehow considered that Sized was playing a role (as the reason one cannot easily exploit this variance with mem::swap and the like is because of !Sized-ness).

So, either this is not exploitable, and thus the struct Foo<'x>(dyn Trait + 'x); case should also work, (minor bug), or it is exploitable, and thus &mut ...<dyn Trait + 'x> should never be covariant (soundness bug!).

2 Likes

I do think Sized has something to do it. That's the problem I immediately ran into when trying to exploit it, because typically to exploit covariance bugs you need to replace the pointed-to value.

Perhaps it can be exploited in some way using the unsized locals feature?

1 Like
#![feature(raw)]

fn swap_unsized<T: ?Sized>(a: &mut T, b: &mut T) {
    use std::raw::TraitObject;
    use std::alloc::Layout;
    
    let a_layout = Layout::for_value(a);
    let b_layout = Layout::for_value(b);
    
    assert_eq!(a_layout, b_layout);
    
    let ptr_layout = Layout::new::<&mut T>();
    
    if ptr_layout.size() == 2 * std::mem::size_of::<usize>() {
        unsafe {
            let a_trait: TraitObject = std::mem::transmute_copy(&a);
            let b_trait: TraitObject = std::mem::transmute_copy(&b);
            
            assert_eq!(a_trait.vtable, b_trait.vtable);
        }
    } else {
        assert_eq!(ptr_layout.size(), std::mem::size_of::<usize>())
    }
    
    unsafe {
        let a = std::slice::from_raw_parts_mut(
            a as *mut T as *mut MaybeUninit<u8>,
            a_layout.size(),
        );
        let b = std::slice::from_raw_parts_mut(
            b as *mut T as *mut MaybeUninit<u8>,
            b_layout.size(),
        ):

        a.swap_with_slice(b);
    }
}

This function should be safe for swapping trait objects. It uses the nightly raw feature, but that's all. It will incidentally work for slices and sized types as well, but I don't think that's relevant right now.

edit: fixed behavior around padding byte

edit 2: swap_unsized is unsound for trait objects precicely to allow the unsizing coercions shown in op, see @eddyb's comments below

1 Like

Indeed. And if we were able to have downcasting without the : 'static bound (although I think that will never happen), then we could swap two identical trait objects by downcast_muting them to their concrete type.


Aside

Using mem::swap::<u8>() there is unsound, you should use ptr::swap, ptr::swap_nonoverlapping or mem::swap::<MaybeUninit<u8>>()

You could, on the other hand, swap in "one go" by using <[MaybeUninit<u8>]>::swap_with_slice()

Other than that, it's a very interesting function that could, for instance, enable .take() on trait objects that originated from an Option type :slightly_smiling_face:

1 Like

It's possible &'short mut Foo<'long> coerces to &'short mut Foo<'short> (via unsizing coercion), can you try adding a *mut (just so you don't need another lifetime) around the whole thing?

It's possible we missed a bug here (I remember we fixed some variance issue in this area a while ago), or it's technically fine (but in a way unrelated to variance, just dyn Trait being special).

AFAICT shrinking the dyn Trait lifetime is unexploitable because nothing inside the trait object can observe it.

The underlying type is always immutable, even if you have a &mut dyn Trait, because it's fundamentally impossible (in Rust) to compare types at runtime (even if they have the same lifetime bound, the actual lifetimes inside might differ, and those can't be compared).

4 Likes

EDIT: misunderstood the suggestion

Sorry, I was a bit too unclear. *mut dyn Trait coerces exactly in the same way.

This is what I meant, and looks like variance is fine and you were seeing an unsizing coercion.

The definite proof is in the MIR though, for the original example you have this line:

_2 = move _3 as &mut dyn Trait (Pointer(Unsize));
3 Likes

That explains why adding a struct in between would break things.

2 Likes

Just to drive the point home here, since I missed some upthread discussion: this means the swap_unsized function is unsound and should never be used.

Also, if you're interested whether this coercion is intentional or where it happens, it's in traits::select (link to specific line).

2 Likes

This is intentional behavior that was RFC'ed in 0599

5 Likes

Even if we didn't have unsizing coercions that could shrink the lifetime bound on the trait, swap_unsized would still be unsound.

The problem is that when you create &mut dyn Trait + 'x from &mut T, only T: 'x is required, T could have any number of lifetimes that aren't separately tracked, they just need to be longer than, or equal to, 'x.

So there's no way at runtime to check that two &mut dyn Trait + 'x point to the same exact concrete type even if 'x is the same.

(I should mention that &mut dyn Any + 'static lets you downcast to some &mut T, and you could mem::swap a pair of those, but the 'static is very important there, disallowing any lifetimes)

3 Likes

This was the first key point, the second one being that "forcing a manual implementation of an assignment" such as @RustyYato's swap_unsize is unsound:

  • Indeed, equality of layout and of vtable pointer (and of 'lt in dyn Trait + 'lt) does not suffice to assume that the underlying / erased types are the same.

    Counter-example

    Given,

    #[derive(Debug)]
    struct Struct<'short, 'a : 'short, 'b : 'short> {
       short: &'short u8,
       a: &'a u8,
       b: &'b u8,
    }
    

    there is no way to distinguish a &mut Struct::<'short, 'a1, 'b1> { ... } as &mut (dyn Debug + 'short) from a &mut Struct::<'short, 'a2, 'b2> { ... } as &mut (dyn Debug + 'short), so they would pass the swap_unsized guards and cause unsoundness, without using the "suprising" unsized coercion of the OP.

2 Likes

I am still somewhat puzzled. So the conclusion was that it is sound because assignment of unsized values is impossible?

That would be a very dissatisfying answer. Subtyping should have to do with the question "is every instance of T also an instance of U", not something about assignment. Relying on unassignability of unsized value is a band-aid if our subtyping relation is broken.

But there also seems to be the conclusion that actually there is no subtyping for &mut trait objects either and something else was going on. (And indeed since &mut is invariant everything else would be really strange.) So mystery solved? Not entirely, I am still wondering if we could have subtyping on that trait object lifetime...

So, is every instance of dyn Trait + 'static also an instance of dyn Trait + 'a? I actually think so, but I am not sure -- because I am not sure if I actually properly understood the meaning of that lifetime bound. Sure, intuitively it has something to do with "the lifetime of the underlying value", but I never found that answer very satisfying -- in building a formal model of Rust lifetimes, the "lifetime of a value" is not a concept that ever came up.

To understand these lifetime bounds I started wondering what would go wrong in the proof if we just didn't have the bound, and came to a conclusion that quite surprised me: I think it has to do with the implicit "well-formed" condition on each function call. When we have a function like fn foo(self: &'a mut Struct<'b>), there are implicit requirements that 'b: 'a and that 'a: 'fn, where I use 'fn to denote the lifetime of the function call. In Rust it is impossible to write a function that does not have these bounds, but in our formal system we do not have that limitation, and you get a perfectly sane system that way -- it's just more explicit. To recover a more Rust-like feel, we added some notation where given a type we compute the WF conditions that a function taking that type would add.

But what happens now if you add type erasure (aka trait objects)? Well, we don't know what the WF conditions of the underlying type are! If we assume foo is part of Trait and we impl Trait for Struct<'_>, then Rust auto-generates a function with the signature fn foo(self: &'a dyn Trait) for us. That function then dispatches to Struct::foo. But Struct::foo requires 'b: 'a, and TraitObjShim::foo has no way to prove that condition! Nothing prevents the user from calling TraitObjShim::foo after 'b is over. Thus doing the call would be unsound.

Now I first thought "OMG we have to abstract over the well-formedness predicate" but it turns out we don't. The lifetime bound in a trait object is a rather ingenious solution to avoid having to do that. The signature of TraitObjShim::foo instead becomes fn foo(self: &'a dyn Trait + 'b), and there is an invariant that outliving 'b is good enough to satisfy all well-formedness requirements of the actual underlying type. That is how the shim can prove to Struct::foo that all well-formedness requirements are upheld.

Basically, the meaning of the lifetime 'a in dyn Trait + 'a is "'a has to still be ongoing for you to be allowed to call trait object methods on it". It has nothing to do with the "lifetime of the underlying value", it is a proxy ensuring you are upholding well-formedness conditions of the methods you are calling.

That is my theory, anyway. I haven't proven it yet, we haven't gotten around to actually formally modelling trait objects. So maybe @eddyb will take it apart immediately. I am curious what y'all think about this.


Now, with that theory in mind, what does that teach us about variance? If the point of the lifetime parameter is to ensure that trait object shims can uphold the well-formedness requirements of the functions they are dynamically dispatching to, that means:

  • If the trait has no methods, we are actually free to change the lifetime any way we want. I think. This is a bold claim and thus probably the best way to test if my theory makes any sense...
  • But generally the trait will have methods. Then the variance should be such that well-formedness can only become stricter, which corresponds to the lifetime becoming shorter. So that's the same variance as references have wrt. their lifetime.

Maybe that's not surprising to anyone involved, but I had to go through all this to make sense of such a statement. But now I am worried because @Yandros said in their original post that it would be "unsound" if Box<dyn Trait + 'long> would coerce to Box<dyn Trait + 'short>. What would be the example for that?

8 Likes

That's indeed the same intuition I've finished by having (much better phrased, obviously :ok_hand:), c.f., the example in my previous code:

struct Struct<'short, 'a : 'short, 'b : 'short> {
   short: &'short u8,
   a: &'a u8,
   b: &'b u8,
}

that is, when a struct has multiple lifetimes, we can find a lifetime such that within that lifetime the usages of any instance of that type cannot be unsound ('short in the example, more generally it would be an "intersection" of lifetimes in practice), since the usage of a trait object is always method-based, and that the methods start well-formed on the concrete types.

You are right, I forgot to edit that post and update it with the findings / conclusions from this thread, thanks for pointing that out :+1:

That sounds correct, I generally try to avoid saying "lifetime of the underlying value" since it's not really a thing, the lifetime bound in the trait object type is more of a "witness" to the shortest lifetime in the static type (or more accurately, a lifetime that's shorter or equal to every lifetime in the static type), preventing the trait object to outlive anything referenced from the static type.

You always have the destructor, so that's a really bad idea IMO. You don't want the semantics to accidentally lead to e.g. Rc<dyn NoMethods + 'a> outliving some borrow a destructor might access.

2 Likes

Fully agreed. This is a common misconception though.

Ah good point, there always is at least one method. I guess you'd have to have a NoDrop marker trait to actually realize my weird idea.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.