Find first duplicated value in an iterator

Is there an easy way to find the first duplicated value in an iterator consisting only of said values (maybe with itertools)? The way I am currently solving this is by building a HashSet until insertion fails due to duplication, which is fine tbh, but just wondering whether someone has another solution.

Since itertools also uses a hashset underneath the hood of all_unique, I don't think there's a better solution.

1 Like

This is what I have for now:

        // Find duplicated ports
        let mut unique_ports = HashSet::new();
        for camera in cameras.iter() {
            if !unique_ports.insert(&camera.port) {
                // throw error
            }
        }

Any improvement suggestions?

Yes, I'd use Itertoos::all_unique:

if !cameras.iter().all_unique() {
    // throw error
}

But I don't get the duplicated value with .all_unique, and I want to display that in the error (sorry, should have made that clear in my example).

Ah yes of course. Then I think your solution is fine. If you abstract it in a function or method check_duplicated_ports I think everybody that will read your code will understand what is going on.

1 Like

You can write your own iterator extension : Rust Playground

5 Likes

In this case this is the entire code I am working on:

impl TryFrom<CamerasConfigJSON> for CamerasConfig {
    type Error = anyhow::Error;

    fn try_from(cameras: CamerasConfigJSON) -> std::result::Result<Self, Self::Error> {
        // Find duplicated ports
        let mut unique_ports = HashSet::new();
        for camera in cameras.0.iter() {
            if !unique_ports.insert(&camera.port) {
                error!("Parsing failed because JSON contains duplicated port: {}", camera.port);
                bail!("Parsing failed because JSON contains duplicated port {}, even though ports are expected to be unique", camera.port);
            }
        }

        Ok(Self(
            cameras.0
                .into_iter()
                .map(|camera| (camera.port, camera.command))
                .collect(),
        ))
    }
}

Do you think it still makes sense to extract it to another method? If yes, where would I put it? Inside the TryFrom impl, or outside as a private function?

In that case I like @tguichaoua's solution. I just feel like hiding away implementation details (the hashset) is always a good idea to increase code quality.

Do you need to preserve the config order when storing it inside CamerasConfig?

If not, it might make sense to store a HashMap¹ inside CamerasConfig:

(untested, just a sketch)

impl TryFrom<CamerasConfigJSON> for CamerasConfig {
    type Error = anyhow::Error;

    fn try_from(cameras: CamerasConfigJSON) -> std::result::Result<Self, Self::Error> {
        let mut port_map = HashMap::new();
        for camera in cameras.0.into_iter() {
            match port_map.entry(camera.port) {
                Vacant(entry) => { entry.insert(camera.command); }
                Occupied(_) => { bail!(...); }
            }
        }

        Ok(Self(port_map))
    }
}

¹ or BTreeMap, if you want a predictable iteration order

4 Likes

This is probably the best solution for my exact use case. Storing a HashMap in CamerasConfig works perfectly. Didn't know about the entry method, but looks super cool. Just to test that I understand the .entry() method correctly (I think the documentation is a little lacking): If the key doesn't yet exist in the hashmap, it gets inserted and I receive an Entry back, into which I can insert the value. In case the key does already exist, I get the Occupied variant back.
Thanks for the great answer!

That's pretty much right, but you've got the order of operations a little bit wrong— If the key doesn't exist, you get a Vacant variant back. Only if you call insert (or similar) on that variant is the hash map written to. So, if you don't call insert, there isn't a dangling slot for the key you passed to entry.

This avoids a double-lookup because the VacantEntry keeps a reference to the empty slot where the item should be stored, which it had to find to discover that the key is missing in the first place.

4 Likes

Quick question: Why not use port_map.insert(key, val) directly? Is there some difference/disadvantage to this compared to the Entry version? The only difference I see is that with your Entry version the value doesn't actually get inserted if the key already exists. But I'm throwing an error in that case anyways, so I don't really care if that element gets added to the hashmap I no longer care about, right?

impl TryFrom<CamerasConfigJSON> for CamerasConfig {
    type Error = anyhow::Error;

    fn try_from(cameras: CamerasConfigJSON) -> std::result::Result<Self, Self::Error> {
        let mut port_map = HashMap::new();
        for camera in cameras.0.into_iter() {
            if port_map.insert(camera.port, camera.command).is_some() {
                bail!("Parsing failed because JSON contains duplicated port {}, even though ports are expected to be unique", camera.port);
            }
        }

        Ok(Self(port_map))
    }
}

Where entry really shines is when you want to do an upsert operation, where you insert a default value (if necessary) and then modify the value (whether you inserted something or not). For example:

fn count_items<T:Hash+Eq>(items: impl Iterator<Item=T>)->HashMap<T,usize> {
    let mut map = HashMap::new();
    for x in items {
        *map.entry(x).or_insert(0) += 1;
    }
    map
}

As this is usually the sort of operation I’m writing, I tend to reach for entry first. In this case, there’s no real advantage over a plain insert; it’s just a question of style.

5 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.