List of unique objects extract from files

Hi everyone ! :smiley:

Actually, I have a Hashset populate with object initialized by parsing XML.

The problem is that the XML that I retrieve can have the same object than an other, but with some fields différent (except of unique ones).
Unique fields are : reference and create date

I have think on 2 solutions :

  1. Extract all in a Vec, and next filter on unique fields (and to get the latest, check the update date)
  2. Order files by date and keep Hashset (If when I add to hashset it keeped the recently one)

The problem is that I am lost on possbile implementation.

What is the best solution for you ?
Could you help me please ? :sweat_smile:
(I do not necessarily ask you to develop for me, just to guide me on the right way/reflection)

Current implementation (no solution already implemented)
#[derive(Debug, Clone, Hash, Eq, PartialEq, Serialize, Deserialize)]
pub struct Ad {
    pub title: String,
    pub description: String,
    pub price: String,
    pub reference: String,
    pub photos: Vec<String>,
    pub creation_date: NaiveDateTime,
    pub update_date: NaiveDateTime,
}

let mut ads_list = HashSet::new();

read_dir
    .filter_map(Result::ok)
    .filter(|f| match f.metadata() {
        Ok(t) => t.is_file(),
        Err(_) => false,
    })
    .filter(|d| {
        d.path()
            .file_name()
            .and_then(|f| {
                f.to_str().and_then(|n| {
                    if let Some(id) = agency_id {
                        if !n.starts_with(format!("import-{}", id).as_str()) {
                            return None;
                        }
                    }

                    if n.ends_with(".xml") {
                        Some(d)
                    } else {
                        None
                    }
                })
            })
            .is_some()
    })
    .for_each(|f| {
        trace!("File {}", f.path().display());
        match self.parse_xml_for_path(f.path(), product_type) {
            Ok(r) => ads_list.extend(r),
            Err(e) => {
                warn!(
                    "An error occured for file '{}' : {}",
                    f.path().as_path().display(),
                    e
                )
            }
        }
    });

A HashSet seems unusual for this purpose. If you want to a new thing to replace something else based on some key, why not use a HashMap?

  • HashMap::insert will always update the map. If you have things already ordered by update date, then you can just insert everything.

  • HashMap::entry is an efficient way to check whether something is already present when inserting, without multiple lookups. You could use this to write a function for inserting an Ad only if it is newer:

use std::collections::{HashMap, hash_map};

pub type Key = (String, NaiveDateTime);

impl Ad {
    pub fn key(&self) -> Key {
        (self.reference.clone(), self.creation_date.clone())
    }
}

fn insert_if_newer(map: &mut HashMap<Key, Ad>, ad: Ad) {
    match map.entry(ad.key()) {
        hash_map::Entry::Occupied(mut entry) => {
            if entry.get().update_date < ad.update_date {
                entry.insert(ad);
            }
        },
        hash_map::Entry::Vacant(mut entry) => {
            entry.insert(ad);
        },
    }
}

P.S. I would remove the Hash and Eq derives from Ad. These traits don't really make sense for this type as it contains things like description and price that are clearly some sort of "payload," making this type unsuitable for use as a key. Just having PartialEq is enough to use things like assert_eq!, which is the primary reason why most people would want to compare a struct like this for equality.

Hi and thank you for your answer

Excellent, that is what I need !
I don't know why I missed this simple solution :sweat_smile:

Thank you again ! :smiley:

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.