Newbie: How to deal with object (non-)uniqueness when using HashMap?

It seems to me that Rust handles object uniqueness differently than other languages, at least the ones I've used. Here's an example:

use std::collections::HashMap;

#[derive(Hash, Eq, PartialEq, Debug)]
struct TypeA {a: usize}

#[derive(Hash, Eq, PartialEq, Debug)]
struct TypeB {b: usize}

fn main() {
    let mut map: HashMap<&TypeA, &TypeB> = HashMap::new();
    let a0 = TypeA { a: 1 };
    let b0 = TypeB { b: 111 };
    map.insert(&a0, &b0);

    let a1 = TypeA { a: 1 };
    let b1 = TypeB { b: 222 };
    map.insert(&a1, &b1);

    let b_for_a0 = map.get(&a0);
    println!("{:?}", b_for_a0); // gives Some(TypeB { b: 222 }) - not the expected Some(TypeB { b: 111 })...
}

In the hashmap above, I associate item a0 with b0. Nontheless, map.get(&a0) returns &b1.
This differs from other major langugages that I've used, where the "object identity" for the object a0 is kept despite it having the same data content as object a1.
(If I change a1 to NOT have the same data content as a0, everything works as expected.)

I guess that this has its roots in how Rust handles object equalities. Is that so?
What is the Rust idiomatic way of dealing with this? Do I have to add a random uuid or something to objects of TypeA to make it possible to distuingish between them?

Yes. Rust does not have “object identity”. This is closely connected with the ideas of ownership and the movability of data. A value (instance) of a type is nothing but the data you specify (here: one usize) and is not inextricably associated with the address of the memory that contains it (which is the usual foundation of “object identity” in languages that have that concept), because it can be moved from one address to another.

Also, Rust references (&TypeA) generally behave like their referent, including for comparison. You can compare addresses by converting references to raw pointers, whose Eq and Hash implementations use only the address — you would get the expected result if you had a HashMap<*const TypeA, &TypeB>, but only so long as you did not move a0 or a1 elsewhere.

It depends:

  • “Object identity” generally only matters intrinsically (as opposed to choosing to make use of it for something) when you have shared mutability such that you care about whether modifying the referent of one pointer will also affect others. If you were doing that in Rust, then you will have already have something like Rc<RefCell<T>> and, if you want object identity behavior, you can write a wrapper type around Rc that, when compared, will call Rc::ptr_eq(), and when hashed, will hash the address.

  • However, if what you want is values that are unique identities, and there is no mutation involved (or sometimes even if there is), then you should just assign them globally unique numbers; this is cheaper than Rcs. An AtomicU64 global counter is sufficient; there is no need for UUIDs unless the data is going to be sent outside your process.

  • In other cases you might use numbers that are simply the indices of the values in some vector.

  • And if you aren't writing a system that intrinsically needs an “identity”, then try not having any.

If you're not sure how your use case fits here, then describe it in more detail.

7 Likes

Yes. Rust’s default model treats compound values the same way as basic values, like integers, in that two instances are considered equivalent if their contents are indistinguishable.

Languages that have an inbuilt notion of object identity usually implement that by referring to the memory address where the value is stored. This option is not available to Rust in the general case, because objects are getting moved around in memory all of the time.

6 Likes

In addition to the moving that was mentioned, Rust has zero-sized types and erased types with no reliable way to perform pointer-based identity, which are also barriers to using addresses as identities in the general case.

5 Likes

I've needed this in several of my projects, and have settled into including a module like this to support it:

use std::hash::{Hash, Hasher};

/// Guaranteed-unique identity value
#[derive(Copy,Clone,Debug,Eq,PartialEq,Hash,Ord,PartialOrd)]
pub struct Tag(std::num::NonZeroU64);

impl Tag {
    pub fn new()->Self {
        use std::sync::atomic::{AtomicU64, Ordering::Relaxed};
        static NEXT: AtomicU64 = AtomicU64::new(1);
        Tag(NEXT.fetch_add(1, Relaxed).try_into().unwrap())
    }

    pub fn tag<T>(self, data:T)->Tagged<T> {
        Tagged { tag: self, data }
    }
}

impl Default for Tag {
    fn default()->Self { Self::new() }
}

/// Identity container for arbitrary types
///
/// Equality and comparison is based *only* on the given tag
#[derive(Copy,Clone,Debug)]
pub struct Tagged<T:?Sized> {
    tag: Tag,
    pub data: T,
}

impl<T:?Sized> Tagged<T> {
    pub fn tag(&self)->Tag { self.tag }
    pub fn check_tag(&self, tag:Tag)->Option<&T> {
        if tag == self.tag { Some(&self.data) }
        else { None }
    }
    pub fn check_tag_mut(&mut self, tag:Tag)->Option<&mut T> {
        if tag == self.tag { Some(&mut self.data) }
        else { None }
    }
    pub fn into_inner(self)->T where T:Sized { self.data }
}

impl<T> From<T> for Tagged<T> {
    fn from(data:T)->Self {
        Tagged { data, tag: Tag::new() }
    }
}

impl<T:?Sized> Eq for Tagged<T> {}

impl<T:?Sized, U:?Sized> PartialEq<Tagged<U>> for Tagged<T> {
    fn eq(&self, rhs: &Tagged<U>) -> bool {
        Tag::eq(&self.tag, &rhs.tag)
    }
}

impl<T:?Sized> Ord for Tagged<T> {
    fn cmp(&self, rhs: &Self) -> std::cmp::Ordering {
        Tag::cmp(&self.tag, &rhs.tag)
    }
}

impl<T:?Sized, U:?Sized> PartialOrd<Tagged<U>> for Tagged<T> {
    fn partial_cmp(&self, rhs: &Tagged<U>) -> Option<std::cmp::Ordering> {
        Tag::partial_cmp(&self.tag, &rhs.tag)
    }
}

impl<T:?Sized> Hash for Tagged<T> {
    fn hash<H>(&self, hasher: &mut H) where H: Hasher {
        Tag::hash(&self.tag, hasher)
    }
}

impl<T:?Sized> std::borrow::Borrow<Tag> for Tagged<T> {
    fn borrow(&self)->&Tag {
        &self.tag
    }
}
3 Likes

Thank you guys for very informative answers!

Hm, I guess this is highlighted in the rust book and elsewhere, but I somehow missed it.

When it comes to a solution for my example case, I'm trying the following. As AtomicUsize doesn't implement the Hash trait, I'm casting it to a usize field of the TypeA type. Does this look as a sound solution?

use std::{collections::HashMap, sync::atomic::{AtomicUsize, Ordering}};

static ID_COUNTER: AtomicUsize = AtomicUsize::new(0);
fn get_unique_id() -> usize {
    ID_COUNTER.fetch_add(1, Ordering::SeqCst);
    ID_COUNTER.load(Ordering::SeqCst).into()
}

#[derive(Hash, Eq, PartialEq, Debug)]
struct TypeA {
    id: usize,
    a: usize,
}

impl TypeA {
    fn new(a: usize) -> Self {
        Self {
            id: get_unique_id(),
            a,
        }
    }
}

Thanks @2e71828 , will have a look into this!

fetch_add() returns a usize (the value before adding), so a separate load should not be necessary. load() also returns a usize, so the .into() is redundant.

4 Likes

The separate load is not only unnecessary, it introduces a bug into the algorithm. If this code is running on two threads concurrently, you can get this sequence:

  1. initial: ID_COUNTER = x
  2. thread A increments, ID_COUNTER = x+1
  3. thread B increments, ID_COUNTER = x+2
  4. thread A loads x+2 from the counter
  5. thread B loads x+2 from the counter

And now, you have two duplicate "unique" IDs.

6 Likes

Ah, thanks. That makes it even simpler:

static ID_COUNTER: AtomicUsize = AtomicUsize::new(0);

#[derive(Hash, Eq, PartialEq, Debug)]
struct TypeA {
    id: usize,
    a: usize,
}

impl TypeA {
    fn new(a: usize) -> Self {
        Self {
            id: ID_COUNTER.fetch_add(1, Ordering::Relaxed),
            a,
        }
    }
}

By the way: you don't need Ordering::SeqCst in your case.

Memory orderings stronger than Relaxed are for synchronization of accesses to other memory besides the atomic variable. You need them when implementing mutexes, semaphores, reference counters or similar data structures.

In your use case, you only need the guarantee that no ID_COUNTER value is used twice. This guarantee is already fulfilled by Ordering::Relaxed:

impl TypeA {
    fn new(a: usize) -> Self {
        Self {
            id: ID_COUNTER.fetch_add(1, Ordering::Relaxed),
            a,
        }
    }
}
5 Likes

Thanx, @cg909! I've updated my latest solution above.