Hi all,
I've been thinking about a dataframe library, modeled after pandas 2.0. My design goal is to have the option of different physical types backing each logical type (e.g. a column of int8 can be backed by a Vec<i8>
, but a column of categorical types can be backed by a custom datatype (similar to a hashmap)).
Questions:
- Traits or enums? Especially for the backing type, I've been leaning towards a trait
Dtype
which allows for flexibility. However, this leads to... - If I have a trait
Dtype
, I will also want to have an associated typeItem
. However, I don't know whichDtype
aColumn
will need until runtime; so theDtype
would need to be a trait object, but trait objects don't allow an associated type (I think). Any suggestions for getting around my dilemma?
Any thoughts on how I might design dataframes, or on usage of traits and enums, is welcome.
Also, wanted to note that pandas 2.0 design doc does something similar in c++ (but they use classes).
I'm trying to avoid having an enum wrapper for values in a Vec for each column, because I'm trying to reduce overhead and because I think it limits using different data structures for each type.
Perhaps I can do this without using an associated type?
From my understanding of generics, it won't let me have the flexibility I want in terms of runtime types.
Some code to illustrate what I'm thinking:
struct DataFrame {
columns: Vec<Column>,
}
struct Column {
// This doesn't work, both because you'd want a Box<DataType> to
// make a trait object (and aren't allowed to because of associated type)
// Also, because it seems that even then you'd have to specify the associated type
// value, which I think I might need ATC for this situation?
dtype: DataType,
}
trait DataType {
type Item;
fn apply(&mut self, Fn(Self::Item) -> Self::Item) // not full sig, but this is the idea
}
struct Int8Type {
values: Vec<i8>,
mask: BitVec, // for storing nulls
}
impl DataType for Int8Type {
type Item = i8,
fn apply()...
}
An alternative approach, move trait to a different level to simplify but same basic issue:
struct DataFrame {
// This seems to be the basic issue: wanting trait objects at runtime, but with a trait that
// has an associated type.
columns: Vec<Box<Column>>,
}
trait Column {
type Item;
fn apply(&mut self, Fn(Self::Item) -> Self::Item) // not full sig, but this is the idea
}
struct Int8Column {
values: Vec<i8>,
mask: BitVec, // for storing nulls
}
impl Column for Int8Column {
type Item = i8,
fn apply()...
}