HashSet, union, expecting HashSet<String>, got HashSet<&String>

  1. Here is the code:
    let lines: Vec<&str> = contents.split_whitespace().collect();

    let mut ans = HashSet::<String>::new();

    for i in lines.iter() {
        let f: String = read_all(*i);
        let t: HashSet<String> = f.split_whitespace().map(|s| s.to_string()).collect::<HashSet<String>>();
        ans = ans.union(&t).collect();
    }
  1. "contents" is a string of filenames, separated by \n.

  2. For each file, we want to read in the entire file, split by whitespace, and put all the words into a hashset.

  3. Then we want to union all the hashsets into one hashset.

  4. In short: We have a collection of files. We want a HashSet of all words.

  5. The problem I am running into is the union line. I am getting error:

   |
66 |         ans = ans.union(&t).collect();
   |                             ^^^^^^^ a collection of type `std::collections::HashSet<std::string::String>` cannot be built from `std::iter::Iterator<Item=&std::string::String>`
   |
   = help: the trait `std::iter::FromIterator<&std::string::String>` is not implemented for `std::collections::HashSet<std::string::String>`

How do I fix this?

You can use the cloned adapter on Iterator to convert it to a Iterator of Strings from of an iterator on &String

2 Likes

Well, as the error suggests, the iterator is a Iterator<Item=&String> so you have two options:

  • Consume the source
  • Copy the data

If you're going to consume the source I'll need abit more info
Otherwise (Untested):

ans = ans.union(&t).map(|a| a.to_string()).collect();

Or, more idiomatically use cloned as @RustyYato recommends.

1 Like

How about ans.extend(t) -- turns t into an iterator by value, then inserts all items into ans.

If your items have some significant property beyond equality, you might want to take more care whether the old or new value is preserved, but for String the only possible difference is capacity.

edit: in this case, you don't really even need the interim t at all. Just:

ans.extend(f.split_whitespace().map(str::to_string))
3 Likes

@RustyYato , @OptimisticPeach , @cuviper :

  1. I am happy to consume one of the two HashSets. I con't care which one.

  2. Is there some "structure" to HashSet which makes merging faster than inserting the elements of one into the other ?

===

  1. My original question was horribly messh. Let me rephrase it as follows
// We ahve two hashsets

let a: HashSet<String> = panic!();
let b : HashSet<String> = panic!();

We can consume one (or both) of the HashSets and merge it into a new HashSet. What is the best way to do this?

Probably use what @cuviper mentioned with extend.

1 Like

Not really. The hashers for each set may be seeded differently, so items have to be rehashed to move to a different set.

Use the larger one as the base, and extend with the smaller one.

2 Likes

@OptimisticPeach , @cuviper : extend it is; thanks for explaining the option space to me.

Now, suppose we have the following:

  1. We have 100 HashSet taht are generated in larallel via rayon::par_iter

  2. I knowabout rayon::iter::ParallelExtend - Rust

  3. par_extend is implemented for HashSet

  4. However, this does NOT help us right? It seems like the use case is:

"every iterator returns a String", so we can can extend a HashSet

but our case is "every par iter returns a HashSet" ... and there's no way to extend this

[I have tried it, and I'm getting a compile error, which is suggesting the above]

You can reduce the parallel-generated sets using the same extend method.

2 Likes

maybe like this?

let i = hs1.drain().chain(hs2.drain()).chain(hs3.drain()).chain(...);

hs0.par_extend(i.par_bridge());
1 Like

@dcarosone , @cuviper :

I can't beleive I forgot about 'reduce.'

If we add a file system + running on multiple nodes, we've basically re-invented "map reduce" :slight_smile:

2 Likes