HashSet<String> intersection HashSet<&str>

Hi,

sorry for a beginner’s question, but I’m currently struggling with the following code:

use std::collections::HashSet;

fn main() {
    let mut s1 = HashSet::<String>::new();
    s1.insert("foo".to_string());
    s1.insert("bar".to_string());
    
    let v = vec!["foo", "baz"];
    let mut s2 = v.iter().cloned().collect::<HashSet<&str>>();
    
    println!("{:?}", s1.get(v[0])); //works
    println!("{:?}", s1.intersection(&s2)); // doesn't work
}

Is it possible to get the intersection without converting references in s2 to owned values i.e. Strings or vice versa? Thanks a lot.

The obvious solution would be to conert the HashSet<String> to HashSet<&str> Otherwise this is not possible, because intersection expects both T to be equal.

println!(
        "{:?}",
        s1.iter()
            .map(|s| s.as_str())
            .collect::<HashSet<_>>()
            .intersection(&s2)
    );

Thanks @hellow that’s what I came up with too. There were two reasons why I asked this question:

  1. if there is some way of implicit handling of this case (as it seems in case of get)
  2. if I can use the same instance of HashSet for both owning the data (Strings) and peforming set operations (intersection, …)

but it seems that the answer for both questions is no.

You can filter with a .contains() closure. This is what the current implementation actually does:

use ::std::collections::HashSet;

fn main ()
{
    let s1: HashSet<String> =
        ["foo", "bar"]
            .iter().cloned()
            .map(ToString::to_string)
            .collect()
    ;
    let array = ["foo", "baz"];
    let intersection: HashSet<&str> =
        array
            .iter().cloned()
            .filter(|&s| s1.contains(s)) // intersection magic
            .collect()
    ;
    println!("{:?}", s1.get(array[0]));
    println!("{:?}", intersection);
}
3 Likes

Thanks @Yandros, this seems fine performance-wise, just from the ergonomic point of view one must manually reimplement all the set operations. One “natural” solution would be to convert collection of T to HashSet of &T once and use it later on, but that leads to self referential struct problem in trivial case.
I wonder if it is so uncommon usecase to work with both collections of T and &T?

If you plan on having both T and &'a T around (specially with 'a = 'static), I suggest you use Cow<'a, T> as the type of the keys of the set:

use ::std::{
    borrow::Cow,
    collections::HashSet,
};

fn main ()
{
    type Str = Cow<'static, str>;
    let s1: HashSet<Str> = 
        vec![String::from("foo"), String::from("bar")] // some Strings
            .into_iter()
            .map(Str::from)
            .collect()
    ;
    let array = ["foo", "baz"];
    let s2: HashSet<Str> =
        array
            .iter().cloned()
            .map(Str::from)
            .collect()
    ;
    println!("{:?}", s1.get(array[0])); // works
    println!("{:?}", s1.intersection(&s2)); // works
    println!("{:?}", &s1 & &s2); // works
}

This looks neat, thanks! I’ll try this in my usecase.

It works! The following is the intended logical workflow (might be perhaps useful for others). The only thing I’m not quite sure about is whether the Cow can’t cause unnecessary allocations.

use std::{
    borrow::Cow,
    collections::HashSet,
};

type Str<'a> = Cow<'a, str>;

fn read_long_lasting_hashset<'a>() -> HashSet<Str<'a>> {
        ["foo", "bar"]
            .iter().cloned()
            .map(Str::from)
            .collect()
}

fn read_short_lasting_data() -> Vec<String> {
        ["foo", "baz"]
            .iter().cloned()
            .map(str::to_owned)
            .collect()
}

fn make_short_lasting_hashset(v: &[String]) ->HashSet<Str> {
    v.iter().cloned()
    .map(Str::from)
    .collect()
}

fn main ()
{
    let s1 = read_long_lasting_hashset();
    let n = 2;
    for i in 0..n {
        let v = read_short_lasting_data(); 
        let s2 = make_short_lasting_hashset(&v);
        println!("{:?}: {:?}", i, &s1 & &s2); // works
    }
}

Nice :slight_smile:

Here come a few comments / tips about your code :wink:

In Rust, having a function <'a>() -> ...<'a> is not idiomatic (i.e., having a lifetime parameter missing from the inputs), since the lifetime parameters are chosen by the caller. This means that this kind of function signatures read like this: for any lifetime parameter 'a I am given, no matter how big it is (e.g., it can be 'static), my implementation can return something that lives at least as long as 'a. This, in practice, means that the implementation is producing items that live at least as long as 'static (which can usually subtype to shorter lifetimes if needed thanks to (co)variance).

So, this kind of function is actually () -> ...<'static>.

fn read_long_lasting_hashset () -> HashSet<Str<'static>>
{
    ["foo", "bar"]
        .iter().cloned() // .copied()
        .map(Str::from) // Cow does not allocate (it just wraps the given pointer)
        .collect()
}

Cow is defined as an enum, so it is just wrapping either a given reference, or a given owned value. So the data it wraps is only allocated if it was given an allocated String to begin with (or when attempting mutation, on the strings themselves, but that does not seem to be the case here).

So, for instance,

fn make_short_lasting_hashset<'a> (v: &'a [String]) -> HashSet<Str<'a>>
{
    v   .iter().cloned()
        .map(Str::from) // this is just taking ownership of the allocated String
        .collect()
}

but there is indeed a case of unnecessary cloning / allocation: when you .iter().cloned(), each item of the slice is cloned. This is fine for small Copy types, such as &str, but it incurs a non-negligeable overhead when doing it with Strings. Rust will allow very soon to use .copied() instead of .cloned() for the cheap case, which will help avoid confusion when dealing with this kind of situations.

The general solution is to just get rid of the .cloned() call:

fn make_short_lasting_hashset<'a> (v: &'a [String]) -> HashSet<Str<'a>>
{
    v   .iter()
        .map(Str::from)
        .collect()
}

but this requires carrying the lifetime 'a of the borrow always around: the code becomes more complicated than with 'a = 'static.

And given your example, you most probably intended to take ownership of the Strings created by read_short_lasting_data():

fn make_short_lasting_hashset (v: Vec<String>) -> HashSet< Str<'static> >
{
    v   .into_iter()
        .map(Str::from) // takes ownership of each `String`
        .collect()
}
  • or, for a more generic version:

    fn make_short_lasting_hashset (
        v: impl IntoIterator<Item = String>, // able to yield owned strings
    ) -> HashSet< Str<'static> >
    {
        v   .into_iter()
            .map(Str::from)
            .collect()
    }
    
  • Playground

2 Likes

Wov! Thanks a lot for such a detailed explanation!

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.