Marker types: Unit-like structs or zero-variant enums? PhantomData or not?

I sometimes use types that simply serve as a marker when passed as type arguments to other types. A recent example in real life code, where I used this, is mmtkvdb::KeysUnique and KeysDuplicate.

In the above case, I used unit-like structs. They are types which have exactly one value, thus carry no information. These are zero-sized types, thus it would be possible to include them in another struct, and they would not consume any memory.

An alternative would be to use zero-variant enums. These types have zero values, thus they can never be instantiated. You could say they are zero-sized too (and apparently they are in Rust), though I prefer to think of them as being of undefined size (or having a size of −∞).

If I use zero-variant enums, then I can not include them like this in a struct:

struct SomeType<M> {
    marker: M,
    /* … */
}

Instead, I would need to use PhantomData:

struct SomeType<M> {
    marker: PhantomData<M>,
    /* … */
}

If I use PhantomData, I could use either unit-like structs or zero-variant enums.

But what is best?

  • unit-like structs without PhantomData
  • unit-like structs with PhantomData
  • zero-variant enums with PhantomData

I compiled a Playground example below to show the different variants and to share some observations I made:

use std::marker::PhantomData;

// We could use unit-like structs:
#[derive(Clone, Copy)]
pub struct Unit1;
#[derive(Clone, Copy)]
pub struct Unit2;

// Or we could use zero-variant enums:
pub enum Never1 {}
pub enum Never2 {}

// If we use unit-like structs, we don't need PhantomData:
pub struct Unitized<M> {
    marker: M,
    payload: i32,
}

// But then we might need extra bounds in a couple of places:
impl<M> Clone for Unitized<M>
where
    M: Clone, // we need this bound in a lot of places :-(
    //M: Default, // or this one
{
    fn clone(&self) -> Self {
        Unitized {
            marker: self.marker.clone(),
            //marker: Default::default(), // or this
            payload: self.payload.clone(),
        }
    }
}

// Since we don't really use `M` for anything but marking,
// we could also make `marker` a `PhantomData`:
pub struct Phantomized<M> {
    marker: PhantomData<M>,
    payload: i32,
}

// That makes things easier:
impl<M> Clone for Phantomized<M> {
    fn clone(&self) -> Self {
        Phantomized {
            marker: PhantomData,
            payload: self.payload.clone(),
        }
    }
}

fn main() {
    // Without `PhantomData`, we must use the unit-like approach:
    let _unitized = Unitized {
        marker: Unit1,
        payload: 0,
    };
    // With `PhantomData` we could either use the unit-like or
    // or the zero-variant enums:
    let _phantomized_with_unit = Phantomized::<Unit1> {
        marker: PhantomData,
        payload: 0,
    };
    let _phantomized_with_never = Phantomized::<Never1> {
        marker: PhantomData,
        payload: 0,
    };
}

(Playground)

In my real-life code, I currently use unit-like structs plus PhantomData, see mmtkvdb::Db, which corresponds to the variant _phantomized_with_unit in the above Playground.

My question is: Is this a matter of taste? Are there reasons to prefer one of the approaches over the others? What do you think of it? What's idiomatic?

5 Likes

My personal feeling has been that PhantomData<Never> is very awkward, because Never is uninhabited, and PhantomData<T> is inhabited regardless of whether T is. When there is a field t: T buried somewhere makes it a case which cannot possibly exist. That uncertainty makes me frown a bit.

4 Likes

I would avoid uninhabited types because I feel it is preferable to use a design which could contain actual data, not just a ZST, or is as close to that as possible.

4 Likes

ZSTs are pretty magical in that you get the benefit of compile-time guarantees via zero-cost runtime type selection.

I only recently saw the zero-variant enum option (here), but am quite interested if there are any real benefits to using it. Because to me it seems like a bit of a mismatch as they can never be instantiated (same as never), so you can only apply them in turbofish type selection, rather than simply passing an instance to a _type_selection: SomeZST function argument. Perhaps it's just a matter of preferred syntax?

1 Like

I think PhantomData is one of Rust's biggest footguns. It has a lot of nuance to correct use, and people tend to encounter it when they are skilled enough to solve moderately hard problems but not experienced enough to really understand what it does beyond "fix the error". This unsurprisingly leads to a lot of incorrect usage, even in std until recently.

Since the phantom is unnecessary here, and simply using a regular ZST also has the advantage of being able to extend to a nonzero size in the future (e.g. dyn Trait, or an enum) I recommend doing that instead.

9 Likes

In general, it's better to prefer a design that stores the type and dispatches based on method calls on the stored instance.

In specific cases, it can be necessary to store a strategy parameter but transmute the container, or have a strategy parameter which only has non-method functionality. In the former case it's necessary and in the latter case it's better to use PhantomData.

Including an instance of the strategy parameter also means that (for dyn safe strategies) you can use a dynamic &dyn Strategy. Including an instance and using a method-dispatched strategy is just more flexible; use it where possible.

2 Likes

Oh yeah, I had forgot about the various flavors of PhantomData. It's noted in the docs, but it would be really nice to have the standard library directly codify them as named type aliases:

use ::std::marker::PhantomData;
use ::std::cell::Cell;

/// The standard covariant form of PhantomData for a type T. This will call drop check of T.
pub type DropCheckPhantom<T> = PhantomData<T>;

/// An alternate covariant form of PhantomData for a type T. Does not call the drop check of T.
pub type NonDropPhantom<T> = PhantomData<fn() -> T>;

/// PhantomData for a type T that is contravariant over T.
pub type ContravariantPhantom<T> = PhantomData<fn(T)>;

/// PhantomData for a type T that is invariant over T.
pub type InvariantPhantom<T> = PhantomData<fn(T) -> T>;

/// PhantomData which will enforce that a struct containing it is not Sync.
pub type PhantomNotSync = PhantomData<Cell<u8>>;

/// PhantomData which will enforce that a struct containing it is neither Sync nor Send.
pub type PhantomNoSyncNoSend = PhantomData<&'static Cell<u8>>;
5 Likes

Following that, shouldn't I then also refrain from using PhantomData? But then I also have to include #[derive(Clone)] and Clone bounds on all methods, I guess.

Note, however, that then a lot of functions dealing with these marker types will need additional bounds. But maybe demanding making them be Clone is cleaner (i.e. more semantically correct) anyway.

What does "non-method functionality" mean?

In my use case, I need to distinguish between databases with unique keys or with duplicate keys both at compile-time (because databases with unique keys may not support certain methods), and also at run-time (when interfacing with the native LMDB library).

Because I need to distinguish at compile-time anyway, I thought I don't really need a method here but could go for a constant instead.

Practically, going from constraint: PhantomData<C> to constraint: C would add a lot of bounds to other functions/methods. But maybe it's still the cleanest approach.

Yes, that's an implication. (Though I would suggest a Copy bound instead, myself; it's more restrictive on the types implementing it, and more convenient, and you can loosen it later if you don't end up using the type in a Copy-only position.)

I see this as being the same kind of principle as “don't put trait bounds on struct declarations”: it may be that large parts of your API require a property (here, the type being a ZST), but that doesn't mean you should make choices that force it to have that property absolutely everywhere.

Of course, if this particular type is in practice always a compile-time flag, then it may be not worth adding any complications for the sake of theoretical flexibility that won't ever matter in practice.

Because I need to distinguish at compile-time anyway, I thought I don't really need a method here but could go for a constant instead.

If you can make that work, that sounds like a good idea. But if I remember correctly, associated constants can't yet feed into const generics, so you can't make a compile-time choice of types dependent on it.

I think using a never-enum is warranted when you specifically want to actively prevent users from constructing a value of the marker type. I'm not sure why you would want to do that, but I'm also inclined not to declare it as something you should never do.

The only substantial difference between the two approaches I can think of off the top of my head is that given a unit struct, you can resolve types by passing a value, since generic function parameters can be inferred:

fn foo<T>(_: T) {
    // dispatch on T
}

fn main() {
    foo(SomeUnitStruct);
    foo(AnotherUnitStruct);
}

whereas following a "pure" type-based approach, you would have to call the function using turbofish: foo::<UnitStruct>().

I'm honestly not sure there is an unequivocally better way, at least certainly not in the general case. Specific use cases might warrant using one or the other, but it's hard to tell in general.

3 Likes

I think this version is great, because it allows it to become like a Strategy pattern - Wikipedia.

The logic that depends on the marker can then live "on" those marker types, perhaps via a sealed trait so that you can update it as needed.

And then you have the option of including state in them in the future if needed. If none of the logic needs any state, then great, it's a ZST and things just work. But if one of the strategies wants to have runtime logic for something, then great, it can store that state and it'll work.

Is that what you want in this case? I don't know. But it's a nice approach for certain kinds of markers.

6 Likes

To summarize:

  • unit-like structs without PhantomData

what @scottmcm said:


  • zero-variant enums with PhantomData

When you know there won't be any runtime state whatsoever / that you are only dealing with type-level shenanigans, this is my preferred approach. Indeed, by virtue of being uninhabited, you know your type will only be usable in the type-level realm, and convey that to the users (provided they be experienced enough to be comfortable with empty enums).


  • unit-like structs with PhantomData

Neither one nor the other; basically the worst of both worlds. Would not recommend.


Bonus: quid of marker types that have to carry generic parameters?

In that case, for non-const generic parameters, neither the empty enum nor the unit struct approach will work. A full struct will be needed.

  • If the type is supposed to be instanced: PhantomData ordeal.

  • If not, I personally go for struct Foo<...>(*mut Self); (especially when having macros and not wanting to special-case the generics).

    This is a rather contrived pattern, though, which I mainly use for "generic modules" & generic consts, which can be useful for:

    • generic const:

      fn oh_no<T> ()
      {
          let mut slots = [None::<T>; 32]; // ❌ Error `Option<T> : Copy` does not hold
      }
      
      fn yay<T> ()
      {
          struct GenericConst<T>(*mut Self);
          impl<T> GenericConst<T> {
              const NONE: Option<T> = None;
          }
      
          let mut slots = [GenericConst::<T>::NONE; 32]; // ✅👌
      }
      
      • Remember that consts can be very useful for both array creation, as well as &'static promotion (e.g., let r: &'static Option<T> = &GenericConst::<T>::None;).
    • "generic module" (in this instance, to feature a const, so there is not that much difference :sweat_smile: but I guess the difference is that it could be public facing)

      //! "I want to write `NULL` pointers in Rust" starter pack.
      
      #[allow(nonstandard_style)] // generic module pattern.
      struct ptr<T : ?Sized>(*mut Self);
      
      impl<T : ?Sized> ptr<T> {
          const NULL: *mut T = ::core::ptr::null_mut();
      }
      
      let p: *mut i32 = ptr::NULL;
      let p2 = ptr::<u8>::NULL;
      
  • And a special mention to a crate dedicated to this very question:

3 Likes

To a first approximation, things that aren't dyn safe. Functions that don't take self, associated types and consts.

1 Like

First of all, thank you very much for all those responses. I will re-read all your posts carefully again, but wanted to give some feedback already.

I would concur with @Yandros that my current approach likely is bad: I use both inhabitated types, yet add a PhantomData. Combining "the worst of both worlds" :face_with_diagonal_mouth:. It doesn't make much sense.

I'm inclined to go the following way:

 pub struct Db<K: ?Sized, V: ?Sized, C> {
     key: PhantomData<fn(&K) -> &K>,
     value: PhantomData<fn(&V) -> &V>,
-    constraint: PhantomData<fn(C) -> C>,
+    constraint: C,
     backend: ArcByAddr<DbBackend>,
 }

 /// Constraints on database (type argument `C` to [`DbOptions`] and [`Db`])
-pub trait Constraint: 'static {
+pub trait Constraint: Copy + 'static {
     /// Duplicate keys allowed?
-    const DUPLICATE_KEYS: bool;
+    fn has_duplicate_keys(&self) -> bool;
 }

 /// Type argument to [`DbOptions`] and [`Db`] indicating unique keys
+#[derive(Clone, Copy)]
 pub struct KeysUnique;
 
 impl Constraint for KeysUnique {
-    const DUPLICATE_KEYS: bool = false;
+    fn has_duplicate_keys(&self) -> bool {
+        false
+    }
 }
 
 /// Type argument to [`DbOptions`] and [`Db`] indicating non-unique keys
+#[derive(Clone, Copy)]
 pub struct KeysDuplicate;
 
 impl Constraint for KeysDuplicate {
-    const DUPLICATE_KEYS: bool = true;
+    fn has_duplicate_keys(&self) -> bool {
+        true
+    }
 }

And then I have to adjust a couple of parts in the rest of the library as well. Mostly that will be adding some C: Constraint bounds, replacing PhantomData with either a concrete value or copying it from some other constraint (e.g. from the builder).

Note that I replaced the constant DUPLICATE_KEYS with a method has_duplicate_keys(&self). Not sure if that makes sense or whether this has any advantages or disadvantages. I guess it makes the implementation more flexible in case I changed something about it in the future.

Interestingly, using Clone instead of Copy doesn't work in some places as I would have trouble in these functions then:

error[E0493]: destructors cannot be evaluated at compile-time
   --> src/lib.rs:764:30
    |
764 |     pub const fn keys_unique(mut self) -> DbOptions<K, V, KeysUnique, ()> {
    |                              ^^^^^^^^ constant functions cannot evaluate destructors
...
774 |     }
    |     - value is dropped here
impl<K: ?Sized, V: ?Sized, C> DbOptions<K, V, C, ()>
where
    C: Constraint,
{
    /// Clear `dupsort` option (also clears `reversedup` option)
    pub const fn keys_unique(mut self) -> DbOptions<K, V, KeysUnique, ()> {
        flag_set!(self, MDB_DUPSORT, false);
        flag_set!(self, MDB_REVERSEDUP, false);
        DbOptions {
            constraint: KeysUnique,
            key: PhantomData,
            value: PhantomData,
            lmdb_flags: self.lmdb_flags,
            name: self.name,
        }
    }
    /* … */
}

Making Constraint: Copy solves this. If I understand right, then Copy assures there is no destructor involved, which is a requirement for keys_unique to be able to be a const fn. So I would stick with Constraint: Copy if I go this way.

Anyway, I'm not sure if this is the best. Using a unit-like type and omitting PhantomData enables me to store information in values of that type in future. But the idea here is to have only a compile-time information. So I liked what @Yandros said here:

In my case, the user usually doesn't deal with these types directly. Instead, functions like keys_unique above use the typestate builder pattern to change the type where needed. Making these types empty enums would emphasize that they only exist on the type-level.

Overall, the unit-like approach seems to be flexible for some (hypothetical) future changes where I want to keep additional information stored in the value-world rather than the type-world only. The zero-variant enum approach instead expliclity reinforces/clarifies that we're playing in type-world only.

I feel torn. Yet I feel like having a better basis to make a decision. I'll think about it some more.

This might not be an issue for me, because the type is modified through typestate builders. But I see how passing the information through a value might feel more natural in other cases.

I would claim that I can make the logic live "on" those marker types also when I use zero-variant enums: I could still have constants and associated functions, right? The "dispatching" would happen at compile-time though (which should also be the case when I use methods on &self with unit-like types, unless I use dyn, I think).

Thanks already for all your posts and considerations.


P.S.

Maybe using zero-variant enums to first force them to be uninhabitated and then using PhantomData to "undo" that choice is kinda awkward indeed, and it might indicate something is "wrong" with that approach. Perhaps zero-variant enums are best to be used when their explicit property of not having any value is needed (I also remember having read that recommendation somewhere in past, warning of the overly use of empty enums).

This also encourages me to get rid of PhantomData here and stay with unit-like structs.

2 Likes

A potential disadvantage is that it won't be possible to use the "has duplicate keys" property in the type system(whenever associated constants can be used as const arguments), but I don't know how likely it is that somebody would want that.

Alright, so how does this work* without PhantomData?

struct ArrayBy<S: SizeArray>
where
    [f64; <S as SizeArray>::ARRAY_SIZE]: Sized,
{
    array: [f64; <S as SizeArray>::ARRAY_SIZE],
}

*on nightly with #![feature(generic_const_exprs)], but still... playground link

Edit: even using a marker/phantom ZST with a trait for associated types doesn't require it: playground

There's a common misuse of empty enums to define new external pointer types in FFI, which is undefined behavior, you might be thinking of that.

Hmmmm, I see. On the other hand, I see the downside that then my Constraint trait can't be turned into a dyn object anymore, which I might need to address another problem in my library that is yet open.

I'll go play a bit with the has_duplicate_keys(&self) approach and see if I can make use of being able to turn Constraint into an object.

Edit: I can't make Constraint into an object if it is Clone or Copy (Playground).

Maybe it was that, but I think it was something different. What you mentioned is this, I think?

Empty Types

[…]

We recommend against modelling C's void* type with *const Void. A lot of people started doing that but quickly ran into trouble because Rust doesn't really have any safety guards against trying to instantiate empty types with unsafe code, and if you do it, it's Undefined Behaviour. This was especially problematic because developers had a habit of converting raw pointers to references and &Void is also Undefined Behaviour to construct.

*const () (or equivalent) works reasonably well for void*, and can be made into a reference without any safety problems. It still doesn't prevent you from trying to read or write values, but at least it compiles to a no-op instead of UB.

Maybe I vaguely remembered this as, "be careful with empty enums".

I think when I work only on the type-level, it's really a matter of preference (or aesthetics), and I can use either. I feel like the unit-like types are more straightforward and the "intendend" concept to use (and don't require PhantomData "tricks").

I just remembered the typenum crate (which is partially obsolete now because of const generics, I assume?), and it also uses structs instead of empty enums. Let's take a look at the Bit trait they use for those:

pub trait Bit: Sealed + Copy + Default + 'static {
    const U8: u8;
    const BOOL: bool;

    fn new() -> Self;
    fn to_u8() -> u8;
    fn to_bool() -> bool;
}

I also ended up with Copy + 'static. Maybe I should add Default too, which might make handling more easy in some cases. Not sure about making my Constraint type sealed though.

Note that the methods to_u8 and to_bool don't work on &self but are associated functions.

The semantics of PhantomData<T> are "the containing type behaves 'as if' it contained an instance of type T, even though no actual instance is provided". If you are familiar with type theory, the usual struct is essentially a sum type, while a PhantomData is an existential type: it guarantees that some value exists, but doesn't specify which one. In the lingo of homotopy type theory, PhantomData<T> is a propositional truncation of T.

This makes PhantomData<Never> something like an oxymoron. A propositional truncation of an uninhabited type is still uninhabited, so I would expect that a struct containing PhantomData<T> is uninhabited as well. In other words, if our struct behaves 'as if' it has an instance of Never, then it is uninhabited itself. Of course, the Rust compiler disagrees, which makes PhantomData into more of a hack than a theoretically sound construct.

It's not particularly surprising, since Rust isn't built in any way around categorical semantics. It doesn't even have a proper never type. Still, this confusion is, for me, a very strong argument to avoid PhantomData unless absolutely necessary, and certainly never use PhantomData<Never> since it is a mongrel type with a contradictory definition.

2 Likes

I'm not familiar with type theory, but I think I can follow the overall outline of your argumentation. In short: If struct S(PhantomData<Never>) would really behave "as if" it contained a Never, then it should be uninhabitated.

I'll likely refrain from using the enum approach and go for unit-like structs.

Only thing I'm undecided is whether I use consts or associated functions or normal methods on the Constraint trait.

This makes me think I should stick with const DUPLICATE_KEYS instead of fn has_duplicate_keys(&self).

1 Like