This function is removing all the subset of doclists in the records and only keeping the superset ones but my data is large so it is taking time to run the code and I am using polars to manipulate data more fastly (use of polars is necessary) to optmize you can even explore polars can give me some inbuilt library function to find out optimized solution.
fn remove_subset_records(df: DataFrame) -> Result<DataFrame, PolarsError> {
let doc_lists: Vec<Vec<String>> = df
.column("docList")?
.list()?
.into_iter()
.map(|opt_s| {
opt_s
.unwrap()
.str()
.unwrap()
.into_iter()
.map(|s| s.unwrap().to_string())
.collect()
})
.collect();
let mut to_keep = vec![true; df.height()];
for i in 0..doc_lists.len() {
for j in 0..doc_lists.len() {
if i != j && is_subset(&doc_lists[i], &doc_lists[j]) {
to_keep[i] = false;
break;
}
}
}
let mask = BooleanChunked::new("mask", to_keep);
df.filter(&mask)
}
fn is_subset(list1: &[String], list2: &[String]) -> bool {
list1.iter().all(|item| list2.contains(item))
}
Well, you do have to compare all hash sets with each other (as far as I can gather from your requirements). You can optimize the inner loop and make the two loops O(n log n) instead of O(n²):
use std::collections::HashSet;
fn main() {
let docs: Vec<HashSet<&str>> = vec![
HashSet::from_iter(["foo", "bar", "baz"]),
HashSet::from_iter(["foo", "bar"]),
HashSet::from_iter(["baz"]),
HashSet::from_iter(["baz", "qux"]),
];
let mut mask = vec![false; docs.len()];
for i in 0..docs.len() {
if mask[i] {
continue;
}
for j in i+1..docs.len() {
if docs[i].is_superset(&docs[j]) {
mask[i] = true;
} else if docs[i].is_subset(&docs[j]) {
mask[j] = true;
}
}
}
println!("{:?}", mask);
}
no i cant use i+1 i have to compare from the start because there can be more than one superset possible for one doclist and I have to keep only the one so have to compare each doclist from the start (Ps: In your answer total records after filtering were 5124 and in mine there were 3176) mine is more accurate here is my solution.
let doc_lists = df.column("docList")?;
let doc_lists: Vec<HashSet<String>> = doc_lists
.list()?
.into_iter()
.map(|opt_set| {
opt_set
.unwrap()
.str()
.unwrap()
.into_iter()
.map(|s| s.unwrap().to_string())
.collect()
})
.collect();
let mut to_keep = vec![true; df.height()];
for i in 0..doc_lists.len() {
if !to_keep[i] {
continue;
}
for j in 0..doc_lists.len() {
if i != j && doc_lists[i].is_subset(&doc_lists[j]) {
to_keep[i] = false;
break;
}
}
}
These are also two nested loops (one for all and one for contains). Using an HashSet can help making this only one loop.
In this loop you will visit each pair (i, j) twice (once as (i, j) and once as (j, i)). This is kinda wasteful because you could determine whether doc_lists[i] is a subset of doc_lists[j] in a single loop by counting how many elements they have in common.
You also need to be careful of equal sets, since they will both be considered subset of the other and both will be removed, even though there might not be any superset of them, so you'll end up losing them.
Yes, an oversight of mine I thought of after composing my previous answer. This might be a better solution than brute forcing:
use std::collections::HashSet;
fn main() {
let docs: Vec<HashSet<&str>> = vec![
HashSet::from_iter(["foo", "bar", "baz"]),
HashSet::from_iter(["foo", "bar", "baz", "qux"]),
HashSet::from_iter(["foo", "bar"]),
HashSet::from_iter(["baz"]),
HashSet::from_iter(["baz", "qux"]),
];
let mut supersets = Vec::new();
supersets.push(&docs[0]);
'outer: for i in 1..docs.len() {
for s in 0..supersets.len() {
// docs[i] has a superset -> continue
if docs[i].is_subset(supersets[s]) {
continue 'outer;
}
// docs[i] is the superset of the superset -> replace the original
// superset
if docs[i].is_superset(supersets[s]) {
supersets[s] = &docs[i];
continue 'outer;
}
}
// docs[i] does not have a superset, nor is it a superset of an already
// inserted superset -> insert it as a new superset
supersets.push(&docs[i]);
}
println!("{:?}", supersets);
}
Yes thanks for equal sets logic ...totally forgot that ..but not in my code because the code will only contain the unique doclists so no need to worry for that