The data is stored in two CSVs which would naturally fall into either a categorical or enum type in polars -- but the data is already encoded as numbers in one CSV (the data CSV) with lookups in the other (the Schema CSV).
So, for example, one CSV has a column called "sex" which contains either the number 0 or 1. Then another CSV has a lookup: 0 maps to Male, 1 to Female. The lookup table for other columns is rather more complex, but I have all the valid values up front.
I am struggling to see how to import this into polars. I could just use the numbers, but I would like to be able to query and get results as "Male" and "Female" not "0" and "1".
I can convert the data CSV by hand to Strings, then have polar load them as enum or categorical datatypes. If I want to save the output, then I'd have to do the reverse if I want to keep the numbering consistent. But, it all seems a lot of effort as Polars will be converting the strings back into a "number with schema lookup" table for me.
Thank you, that is useful. But that would result in two columns one as an integer, one as a string. If I want to search for all mammal, then, I would still have to do a string matching search on category_name which would be slow; so I'd have to search on category_id for 0 instead.
My understanding of the documentation for both the categorical or enum datatypes is that it will do this for me; I get to search for mammal but I am transparently searching for 0 instead. So, I'd still need to transform the results table to get that.
I didn't realize Polars had the Categorical/Enum types. They look interesting but don't appear to support storing your data in a normalized form (i.e. where the category to ID mapping is stored externally in another file.
I looked at the docs and the Polars source code and it appears there is no way to explicitly define the mapping of internal ID to category name. You can only pass an ordered list of category names, and the IDs are inferred internally.
The Enum support appears to be early and still in an experimental state too, so it sounds like it's not mature yet.
Whats the issue with storing a lockup table explicitly? Could even be just a vec.
You then hold things as IDs and work from there. If there is key clashing and ID by itself is not enough u can allways make your own ids that map directly to unique strings.
Thanks for looking at the source -- I just tried the documentation; should have looked there also. I am guessing it would be easy to add this to the Enum type as you have to define the categories up front; but if it isn't there, then I am stuck with a two step process.
Why do something myself that I can get polars to do for me? Besides, I'd rather have queries that search for mammal than 0. The code will be much easier to read.
well its ur choice but do note that C style enums can make it so mammal is 0. its a bit anoying to use from rust but if you wrap it properly should be fine.
good luck with this sounds like an anoying nightmare