std::marker::PhantomData and unused fields in structs

I'm trying to write the stretch goals of chapter 13 in the book

In particular:

To fix this issue, try introducing more generic parameters to increase the flexibility of the Cacher functionality.

I have the following definition of Cacher which I was pretty happy with. I made it generic on functions with different input and output types, but there's a problem, this code won't compile:

pub struct Cacher<F,U,V>
    where F: Fn(U) -> V, U: Copy
{   
    calculation: F,
    value: Option<V>,
}

impl<F,U,V> Cacher<F,U,V>
    where F: Fn(U) -> V, U: Copy
{
    pub fn new(calculation: F) -> Cacher<F,U,V> {
        Cacher {
            calculation,
            value: None,
        }
    }

    pub fn get(&mut self, arg: U) -> V {
        match self.value {
            Some(v) => v,
            None => {
                let v = (self.calculation)(arg);
                self.value = Some(v);
                v
            }
        }
    }
}

compilation error:

error[E0392]: parameter `U` is never used
 --> src/structs.rs:3:21
  |
3 | pub struct Cacher<F,U,V>
  |                     ^ unused parameter
  |
  = help: consider removing `U`, referring to it in a field, or using a marker such as `std::marker::PhantomData`

Which leads me to std::marker::PhantomData, from reading this I gather that PhantomData is used to guide the type-checker where it can't "figure out stuff" on its own.

But it also led me to believe that the PhantomData isn't really there at all. In the examples, struct fields are prefixed with _, which usually means ignore this bit.

So then y question (finally, thanks for reading this far) is why does the compiler enforce PhantomData fields be defined when making the struct?:

use std::marker::PhantomData;

pub struct Cacher<F,U,V>
    where F: Fn(U) -> V, U: Copy
{
    calculation: F,
    value: Option<V>,
    _phantom: PhantomData<U>,
}

gives the error:

error[E0063]: missing field `_phantom` in initializer of `structs::Cacher<_, _, _>`
  --> src/structs.rs:15:9
   |
15 |         Cacher {
   |         ^^^^^^ missing `_phantom`

when using Cacher like so:

fn generate_workout(intensity: u32, random_number: u32) {
    let mut cacher = Cacher::new(simulated_expensive_calculation);
    // I don't want to pass in _phantom here ^ :(

Am I missing something? Can I tell the compiler to ignore the "missing" type parameter U, which I'm using in the input to the closure F?

When you use PhantomData as a value all you have to do is this

let x: Cacher<bool, u8> = Cacher::new(Some(0_u32));

I did manage to get this compile in the end - does anyone else find that the process of asking the question well on a public forum organizes their thoughts well enough to solve the problem?

Anyway, the solution was to make Cacher::new supply some actual PhantomData which was a little weird but fair enough I guess. Then the new function didn't need additional parameters because they are "hardcoded" into the new method definition.

If you didn't have a new method you would still need to pass the data when building Cacher though... :frowning: Something like:

let cacher = Cacher {
    calculation: some_expensive_function,
    value: None,
    _phantom: PhantomData {},
};

which is ugly, but hey, the new method was a good idea anyway.

Cacher looks like this in my code now:

use std::marker::PhantomData;

pub struct Cacher<F,U,V>
    where F: Fn(U) -> V, V: Copy
{
    calculation: F,
    value: Option<V>,
    _phantom: PhantomData<U>,
}

impl<F,U,V> Cacher<F,U,V>
    where F: Fn(U) -> V, V: Copy
{
    pub fn new(calculation: F) -> Cacher<F,U,V> {
        Cacher {
            calculation,
            value: None,
            _phantom: PhantomData {},
        }
    }

    pub fn get(&mut self, arg: U) -> V {
        match self.value {
            Some(v) => v,
            None => {
                let v = (self.calculation)(arg);
                self.value = Some(v);
                v
            }
        }
    }
}

This seems like a hack to me, but from the RFC linked to in the SO answer maybe this will be prettier in the future? Maaaybe?

pub struct Cacher<F,U,V>
    where F: Fn(U) -> V, V: Copy
{
    calculation: F,
    value: Option<V>,
    phantom U
}

Thanks :slight_smile: but... I don't wanna, luckily I found what the compiler was really complaining about, the new constructor didn't mention _phantom. I added it there and the signature of new didn't need to change :slight_smile:

1 Like

PhantomData is actually a unit struct, so it's like doing the following:

struct Foo;
let x: Foo = Foo;

Except that PhantomData is special in that it's the only unit struct that's allowed to have a generic parameter.
In fact, the compiler essentially does the following:

struct Foo;
const Foo: Foo = Foo {};

//And, if we could write this:
struct PhantomData<T>;
const<T> PhantomData: PhantomData<T> = PhantomData<T> {};

It really isn't :wink: Sure, at first glance it seems that requiring an extra unused field just to be able to carry a type parameter looks cumbersome and unneeded, but it turns out that is paramount to have such a construction as soon as the language offers subtyping and variance. These are quite advanced topics, on which you may find (my) one-post-explanation here: Looking for a deeper understanding of PhantomData

The other option would have been not to have any variance whatsoever, and that would have been incredibly cumbersome. So it is a lesser inconvenience.

That being said, in your case you could get away without PhantomData, if you allowed "invalid" unusable Cachers to be constructed, by moving the bounds onto the get() function:

It may look simpler for the library writer, but imho having the Fn bound since the beginning like you did is way healthier for a user of the library :slight_smile:


Finally, here are a few tips:

  1. You can loosen V : Copy to V : Clone, and use an explicit .clone() call

  2. You can loosen F : Fn... to F : FnOnce since you will only be calling the function once. This gets tricky since you need owned access to the closure field to call an FnOnce, but your .get() method takes a self receiver by unique reference (&'_ mut Self). The trick to go from &mut T to T is to mem::replace the value with some arbitrary default sentinel, and in Rust the type that lets you do that in a very idiomatic way is Option, and its .take() method.

  3. Option offers a neat method / idiom for the "match an Option and on None compute" pattern: .get_or_insert_with

    pub
    struct Cacher<F, U, V>
    where
        F : FnOnce(U) -> V,
        V : Clone,
    {
        calculation: Option<F>,
        value: Option<V>,
        _phantom: ::std::marker::PhantomData<U>,
    }
    
    impl<F, U, V> Cacher<F, U, V>
    where
        F : FnOnce(U) -> V,
        V : Clone,
    {
        pub
        fn new (calculation: F) -> Cacher<F, U, V>
        {
            Cacher {
                calculation: Some(calculation),
                value: None,
                _phantom: Default::default(),
            }
        }
    
        pub
        fn get (self: &'_ mut Self, arg: U) -> V
        {
            let calculation = &mut self.calculation;
            self.value
                .get_or_insert_with(|| {
                    calculation.take().unwrap()(arg)
                })
                .clone()
        }
    }
    

The last one is just to show an example of the utilities present in Rust most pervasive data structures: know that the version with the match is just as good :slight_smile:

1 Like

I would go a step further and do this, to make the most general api

pub struct Cacher<F, V> {
    calculation: Option<F>,
    value: Option<V>,
}

impl<F, V> Cacher<F, V>
{
    pub fn new (calculation: F) -> Self {
        Self {
            calculation: Some(calculation),
            value: None,
        }
    }

    pub fn get<U>(self: &mut Self, arg: U) -> &V
    where
        F : FnOnce(U) -> V,
    {
        let calculation = &mut self.calculation;
        self.value
            .get_or_insert_with(|| {
                calculation.take().unwrap()(arg)
            })
    }
}

edit:

We can be a bit more memory efficient by realizing that we never need to store F and V at the same time, like so

2 Likes

Only now grokking this... very clever!
At first I had no idea why you would want to have Option<F> and Option<V>.
My thought being surely rust wouldn't let you build a Cacher without them?

But for a "single-use" Cacher such as this is without adding any HashMap to store multiple values then yeah, storing the function that you'll never need after it's been used is a waste.

Looked at your memory efficient playground as well, is there much benefit using mem::replace to (I think?) "swap" the function for the value once it's calculated over using two fields with Options?
I guess rust will reserve memory for the calculation field even if it's not used, i.e. while Option is None
It's a little more complicated than anything I would have come up with, but exposes the same API :slight_smile: so just as easy to use!

Thanks for taking time to comment on this, the examples especially has been really helpful :slight_smile:

I guess it just felt a hack as I didn't really understand why a PhantomData was needed. I'm starting to get there, though the type theory involved around variance (contra-variant co-variant) still eludes me. It's really not too cumbersome :slight_smile: I'm sure it's worth it! I'm firmly at the stage of understandig PhantomData as:

  • At runtime, PhantomData is exactly like (), i.e., nothing.
    • This gives the peace of mind of it being (a) zero-cost (abstraction)
  • For compile-time analysis, PhantomData is exactly like T
    • The utility of this may not be obvious at first glance, but it turns out that types ... <sometimes need this?>

:stuck_out_tongue:

Loosen V: Copy to V: Clone

Am I right in thinking that the motivation behind this is so that a library user who wanted to Cache Strings or whatnot has the option to? I guess it's bad taste to enforce only Copy if a user wants to Clone.

I really need to pay around with this mem::replace(). Though I'm guessing it's mostly used in the background by the std library.
I hadn't heard of sentinels before, a good word to learn.

Both you and @KrishnaSannasi suggested Option for storing the calculation and value, glad to see there is some shared good practice getting around :slight_smile: - until you start lowering the memory footprint with clever use of enums :wink:

The memory savings is as such, let's say we have a F = some zero sized function, and V = String,

State<F, V> == 32 bytes
(Option<F>, Option<V>) == 32 bytes

How about when F == String, same V = String

State<F, V> == 32 bytes
(Option<F>, Option<V>) == 48 bytes

How about when F == (String, String), same V = String

State<F, V> == 56 bytes
(Option<F>, Option<V>) == 72 bytes

So as you can see, as objects get larger, State<_, _> doesn't take up as much space as (Option<F>, Option<V>)

You can test this using the std::mem::size_of function


Yes, Rust will allocate storage for the None case, even if unused. So you can explicitly tell Rust, no I want to reuse the memory and use an enum.


I just realized that I left a loop {} inside my last playground, that should have been changed to a panic, here is the fix playground

1 Like

I was curious about that loop! I figured it didn't matter too much since that shouldn't ever happen :stuck_out_tongue:

I was testing things out to see which unreachable panics would be removed, and used the loop {} to remove that panic!(). But I forgot to put the panic!() back, whoops. That branch will happen if the func panics, you catch that panic using std::panic::catch_unwind, and try to run the cacher again.

The results, I could never get Rust to remove the first unreachable on it's own, but the second unreachable was consistently optimized out. But since that branch should only be run once, that shouldn't be an issue.

1 Like