How to share a global-ish state between several structs?

To avoid making this an XY problem, here's some context first:

I'm writing a Rust program that manipulates math expressions. The expressions are stored in a tree, where each level is a different struct.

Here's an oversimplified example:

struct Sum(Vec<Product>);

struct Product(Vec<Factor>);

enum Factor{
    Number(i64),
    Variable(char),
    Parens(Sum)
}

What I'm trying to do is create unique IDs for each unique instance, such that equal (as in, having the same set of children, regardless of order!) structs get the same ID.

I came up with the following:

I created a HashMap<{set of children's IDs}, ID>, and stored an ID in each of my structs.

The problem I'm facing now is how to share the reference to the map with each struct.

  • I could put the map into a static variable. But I feel like that would make it "too global" -- it would lock all instances of expressions into using the same map.
  • I could store references to the map (or Rcs) next to each ID, inside each struct. That sounds wasteful, and doesn't prevent "mixing" two ID systems (i.e. maps) in the same expression.
  • I could make an IDManager trait and make my structs generic over it. That feels the closest to what I imagined, but generics are generic over implementations of a trait, not instances of a struct.

Did I miss any ways to make this work? Which of my ideas would work better? Is there any way to encode this in the type system somehow? Is there a different system that would be easier to implement?

Is another option to pass the map as a parameter to the operations (functions or methods) that act on them (edit: on the structs, I mean)? Parameter passing is often the simplest and best thing for sharing an object.

1 Like

Yes, it's possible.

It still doesn't prevent misuse (it does not prevent passing a different map than the one that contains the existing IDs). But that's a common problem in almost all of my attempts, so maybe I'll just have to live with that. (after all, in most other languages I'd have half as many safeguards, and programs written in them do work)

But I will wait a little longer and also think in the meantime...

Yeah, an EvaluationContext is probably the way to go if you want to assign unique IDs. But...

this tells me that you're asking for content-addressable storage of subexpressions so that you can get Common Subexpression Elimination for free. Which means you need a high-quality hash.

Try looking for solutions using those two ideas :slight_smile:

An EvaluationContext probably wants to hold a reference to your Storage. And then you'll need garbage collection, scan roots, etc.

Rowan, the syntax tree implementation used by rust-analyzer, uses a similar scheme, though a bit more restricted (in that order matters and it only tracks "small" nodes). If it's not necessary for correctness, then rowan's approach of making all node construction go through a cache and permitting reusing the cache whenever nodes get constructed is probably the best approach. If it's necessary for correctness (i.e. it's not just a GVN style optimization), then the best bet is probably to make nodes know what cache they belong to (either via reference or numbering them as well, potentially packed into the node id) and panicking if caches mingle.

If you absolutely must tie to a single instance of a cache, you can use a generative branded lifetime. But doing so is extremely constraining (even moreso than storing a reference to the cache), so is almost certainly not what you actually want to do.

2 Likes

I took quite some time researching the terms and concepts thrown in here...

I think, @CAD97's generative lifetime branding thing combined with passing the map as an argument, as suggested by @jumpnbrownweasel may just end up working for me.
I implemented it, and so far I have no issues (though I haven't integrated it into the rest of my project yet).

What do you mean by that? I don't see how it would constrain me, but maybe I'm missing something. Like, if I want to transfer an expression into a different cache, I can just copy the fields of the struct over and recompute the ID?

Btw, the use of lifetimes for this is a brilliant, if hacky, solution! I was trying to come up with a way to encode this in the type system, but I would have never thought of using lifetimes. Your crate has one more dependent now :wink:


Anyway, here is my current impementation:

use std::cell::OnceCell;
use std::collections::HashMap;
use std::num::NonZeroU64;
use std::ops::{Deref, DerefMut};

use generativity::{ make_guard, Guard, Id as CacheId };

pub struct Cache<'cache_id> {
    id: CacheId<'cache_id>,
    map: HashMap<Signature, NonZeroU64>,
    autoincrement: NonZeroU64,
}
impl<'cache_id> Cache<'cache_id>{
    pub fn new(guard: Guard<'cache_id>) -> Self {
        Self{
            id: guard.into(),
            autoincrement: unsafe{ NonZeroU64::new_unchecked(1) },
            map: HashMap::new(),
        }
    }
    pub fn get_id(&mut self, signature: Signature) -> ID<'cache_id> {
        ID(
            self.id,
            *self.map.entry(signature).or_insert_with(|| { 
                let val = self.autoincrement;
                self.autoincrement = self.autoincrement.checked_add(1).expect("we ran out of IDs");
                val
            })
        )
    }
}

pub trait GenerateID<'cache_id>{
    fn id(&self, cache: &mut Cache<'cache_id>) -> ID<'cache_id>;
}

#[derive(Clone, Copy)]
#[repr(transparent)]
pub struct ID<'cache_id>(CacheId<'cache_id>, NonZeroU64);

#[derive(PartialEq, Eq, Hash)]
pub enum Signature{
    OneOfTheStructs(u64),
}

struct OneOfTheStructs(u64);

impl<'cache_id> GenerateID<'cache_id> for OneOfTheStructs{
    fn id(&self, cache: &mut Cache<'cache_id>) -> ID<'cache_id> {
        let &OneOfTheStructs(x) = self;
        cache.get_id(Signature::OneOfTheStructs(x)) 
    }
}

Thanks to everyone for the helpful ideas!

I think the hash approach is the true "correct" way forward here. It should be possible to design a good hash function over your expressions - given the order invariance, maybe you could construct a canonical ordering for your data type and simply sort it on the way in?

An alternate idea is to make sums and products have their hashes be the (wrapping) sums and products of their constituent elements, though I think that may cause collisions. Maybe you could do it all in some magic finite field of the same cardinality as u64 to make the hash better-behaved.

Well, yeah, that's where my mind went first as well.

But having the possibility of hash collisions led me to think that after checking the hashes, I would have to check the equality anyway, ie:

fn eq(ex1, ex2) -> bool {
    ex1.hash == ex2.hash && recursively_check_is_eq(ex1, ex2)
}

...where recursively_check_is_eq would be relatively expensive to compute.

So I tried to come up with a way that would guarantee me that there are no collisions, so I wouldn't have to check.

If you want an easy solution, use Hash as an id:

#[derive(Eq, PartialEq, Hash, Ord, PartialOrd)]
struct Sum(BTreeSet<Product>);

#[derive(Eq, PartialEq, Hash, Ord, PartialOrd)]
struct Product(BTreeSet<Factor>);

#[derive(Eq, PartialEq, Hash, Ord, PartialOrd)]
enum Factor{
    Number(i64),
    Variable(char),
    Parens(Sum)
}

That, in itself, won't work for me. I need the ID to be recursively order-independent.
I did think of order-independent hashes though, see @claytonwramsey's suggestion and my reply.