Can anyone optimize this code to remove this double looping (if possible removing any explicit looping)

This function is removing all the subset of doclists in the records and only keeping the superset ones but my data is large so it is taking time to run the code and I am using polars to manipulate data more fastly (use of polars is necessary) to optmize you can even explore polars can give me some inbuilt library function to find out optimized solution.

fn remove_subset_records(df: DataFrame) -> Result<DataFrame, PolarsError> {
    let doc_lists: Vec<Vec<String>> = df
        .column("docList")?
        .list()?
        .into_iter()
        .map(|opt_s| {
            opt_s
                .unwrap()
                .str()
                .unwrap()
                .into_iter()
                .map(|s| s.unwrap().to_string())
                .collect()
        })
        .collect();

    let mut to_keep = vec![true; df.height()];

    for i in 0..doc_lists.len() {
        for j in 0..doc_lists.len() {
            if i != j && is_subset(&doc_lists[i], &doc_lists[j]) {
                to_keep[i] = false;
                break;
            }
        }
    }

    let mask = BooleanChunked::new("mask", to_keep);

    df.filter(&mask)
}

fn is_subset(list1: &[String], list2: &[String]) -> bool {
    list1.iter().all(|item| list2.contains(item))
}

I think you can collect the doc lists into HashSets instead and use HashSet::is_superset to filter the supersets.

1 Like

In my code I have done that but still I will have to keep those 2 for loops in code

Well, you do have to compare all hash sets with each other (as far as I can gather from your requirements). You can optimize the inner loop and make the two loops O(n log n) instead of O(n²):

use std::collections::HashSet;

fn main() {
    let docs: Vec<HashSet<&str>> = vec![
        HashSet::from_iter(["foo", "bar", "baz"]),
        HashSet::from_iter(["foo", "bar"]),
        HashSet::from_iter(["baz"]),
        HashSet::from_iter(["baz", "qux"]),
    ];
    
    let mut mask = vec![false; docs.len()];
    
    for i in 0..docs.len() {
        if mask[i] {
            continue;
        }
        
        for j in i+1..docs.len() {
            if docs[i].is_superset(&docs[j]) {
                mask[i] = true;
            } else if docs[i].is_subset(&docs[j]) {
                mask[j] = true;
            }
        }
    }
    
    println!("{:?}", mask);
}

Playground.

no i cant use i+1 i have to compare from the start because there can be more than one superset possible for one doclist and I have to keep only the one so have to compare each doclist from the start (Ps: In your answer total records after filtering were 5124 and in mine there were 3176) mine is more accurate here is my solution.

  let doc_lists = df.column("docList")?;
    let doc_lists: Vec<HashSet<String>> = doc_lists
        .list()?
        .into_iter()
        .map(|opt_set| {
            opt_set
            .unwrap()
            .str()
                .unwrap()
                .into_iter()
                .map(|s| s.unwrap().to_string())
                .collect()
            })
        .collect();
    let mut to_keep = vec![true; df.height()];

    for i in 0..doc_lists.len() {
        if !to_keep[i] {
            continue;
        }
        for j in 0..doc_lists.len() {
            if i != j && doc_lists[i].is_subset(&doc_lists[j]) {
                to_keep[i] = false;
                break;
            }
        }
    }

A couple of observations:

These are also two nested loops (one for all and one for contains). Using an HashSet can help making this only one loop.

In this loop you will visit each pair (i, j) twice (once as (i, j) and once as (j, i)). This is kinda wasteful because you could determine whether doc_lists[i] is a subset of doc_lists[j] in a single loop by counting how many elements they have in common.

You also need to be careful of equal sets, since they will both be considered subset of the other and both will be removed, even though there might not be any superset of them, so you'll end up losing them.

1 Like

Yes, an oversight of mine I thought of after composing my previous answer. This might be a better solution than brute forcing:

use std::collections::HashSet;

fn main() {
    let docs: Vec<HashSet<&str>> = vec![
        HashSet::from_iter(["foo", "bar", "baz"]),
        HashSet::from_iter(["foo", "bar", "baz", "qux"]),
        HashSet::from_iter(["foo", "bar"]),
        HashSet::from_iter(["baz"]),
        HashSet::from_iter(["baz", "qux"]),
    ];
    
    let mut supersets = Vec::new();
    supersets.push(&docs[0]);
    
    'outer: for i in 1..docs.len() {
        for s in 0..supersets.len() {
            // docs[i] has a superset -> continue
            if docs[i].is_subset(supersets[s]) {
                continue 'outer;
            }
            
            // docs[i] is the superset of the superset -> replace the original
            // superset
            if docs[i].is_superset(supersets[s]) {
                supersets[s] = &docs[i];
                continue 'outer;
            }
        }
        // docs[i] does not have a superset, nor is it a superset of an already
        // inserted superset -> insert it as a new superset
        supersets.push(&docs[i]);
    }
    
    println!("{:?}", supersets);
}

Playground.

Instead of pushing references to the the hash sets into supersets you can also push indices, which would make it easier to construct a mask vector.

1 Like

Yes thanks for equal sets logic ...totally forgot that ..but not in my code because the code will only contain the unique doclists so no need to worry for that

Anyway here's a commented version that handles equal sets too:

At this point if there's a complexity improvement I think it would be with some kind of trie.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.