FInd Unique Record based upon Array of string in JSON using Polars

 {
    "efileId": "E-6db46a60-1736-4dd1-8964-fb36f237e487",
    "stateName": "federal",
    "returnType": "1040",
    "docList": [
      "dMainInfo",
      "dReturnInfo",
      "d1040",
      "dIRS1040Sch2",
      "dPriceList",
      "dCustomLetter",
      "d8867",
      "dConsentToDisclose",
      "dConsentToUse",
      "d8962",
      "dIncDeductionInterviewMode",
      "d8879",
      "dW2",
      "dSchC",
      "d1099NEC",
      "d1095A",
      "d6198"
    ],
    "ackBalanceDue": 0,
    "ackRefund": 35
  },
  {
    "efileId": "E-64a009af-708c-4895-891b-c647ae5c62cd",
    "stateName": "federal",
    "returnType": "1040",
    "docList": [
      "dMainInfo",
      "dReturnInfo",
      "d1040",
      "d2YrFormCompare",
      "dW2",
      "dCustomLetter",
      "dConsentToDisclose",
      "dConsentToUse",
      "d8879",
      "dFormPayment"
    ],
    "ackBalanceDue": 1,
    "ackRefund": 0
  },
  {
    "efileId": "E-62b5aa4a-9fee-4a65-a910-1496be16a2f1",
    "stateName": "ny",
    "returnType": "201",
    "docList": [
      "dNYIT201",
      "dNYWktMIscwkt",
      "dTR579IT",
      "dNY573",
      "dNYIT213",
      "dNYIT213Wkt",
      "dNYIT215"
    ],
    "ackBalanceDue": 0,
    "ackRefund": 0
  }
}

I have Json data of 2 lakh records like this and i want to find that unique records which has a unique cobination of doclist(no other doclist has that combination like that in that record) using polars can anyone help me with the code and also I have three varaint of the problem

  1. first is the upper one (Unique docList records)
  2. second one is to find certain value not exist in docList.
  3. Count of same docList exist in the data.

Please post text in code blocks, not images:

```json
{
    "jsonData": "Goes here"
}
```

:arrow_down:

{
   "jsonData": "Goes here"
}

Yep..Done...Can you help me with the solution ?

What do you mean by "using polars"? If you have JSON, you should be using a JSON parser, not polars (which is a flat data frame library).

If it fits in memory, use a counter data structure. Something like a map.

There's probably something to that effect in polars. But the stdlib would be sufficient.

I know very little about polars, but it could be something like series.lazy().group_by([col("docList")])

Note for myself and others: 2 lakh is 200000, 2×10⁵.

Polars is best for Data analysis and is faster than even Pandas and can be very effective for a large data analysis

I'm not particularly interested in the marketing; I'm curious how you are planning to parse JSON without a JSON parser.

polars does Serde serialization and deserialization in background dont worry

If you want a solution without polars, assuming you can translate yourself into polars terms, and assuming Item is your Rust struct representing one of your records:

use std::collections::HashMap;
use std::hash::{DefaultHasher, Hash, Hasher};

fn hash<T: Hash>(t: &T) -> u64 {
    let mut s = DefaultHasher::new();
    t.hash(&mut s);
    s.finish()
}

fn group(list: Vec<Item>) -> HashMap<u64, Vec<Item>> {
    let mut map = HashMap::new();
    for item in list {
        map.entry(hash(&item.doc_list)).or_default().push(item);
    }
    map
}

All the items in the same bucket should have the same docList. You can filter by bucket len, etc.

1 Like

This is a great solution without Polars! Well done! If you don't mind, could you share how you arrived at this solution (like any source or by just your own mind )?

I don't know. To me this is a pretty standard "group by"/"aggregate" operation. The counter data structures are generally taught/learned very early.

The only "trick" here is that I didn't want you to clone a Vec<String> for the keys so I stored a hash of those instead of using them directly as keys (which I could have done).

Polars should be able to do the same very easily

1 Like

I have done my code in polars by my own

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.