Traits and Dataframe


#1

Hi all,

I’ve been thinking about a dataframe library, modeled after pandas 2.0. My design goal is to have the option of different physical types backing each logical type (e.g. a column of int8 can be backed by a Vec<i8>, but a column of categorical types can be backed by a custom datatype (similar to a hashmap)).

Questions:

  1. Traits or enums? Especially for the backing type, I’ve been leaning towards a trait Dtype which allows for flexibility. However, this leads to…
  2. If I have a trait Dtype, I will also want to have an associated type Item. However, I don’t know which Dtype a Column will need until runtime; so the Dtype would need to be a trait object, but trait objects don’t allow an associated type (I think). Any suggestions for getting around my dilemma?

Any thoughts on how I might design dataframes, or on usage of traits and enums, is welcome.

Also, wanted to note that pandas 2.0 design doc does something similar in c++ (but they use classes).


I’m trying to avoid having an enum wrapper for values in a Vec for each column, because I’m trying to reduce overhead and because I think it limits using different data structures for each type.

Perhaps I can do this without using an associated type?

From my understanding of generics, it won’t let me have the flexibility I want in terms of runtime types.


Some code to illustrate what I’m thinking:

struct DataFrame {
  columns: Vec<Column>,
}

struct Column {
  // This doesn't work, both because you'd want a Box<DataType> to
  // make a trait object (and aren't allowed to because of associated type)
  // Also, because it seems that even then you'd have to specify the associated type
  // value, which I think I might need ATC for this situation?
  dtype: DataType,
}                            

trait DataType {
  type Item;
  
  fn apply(&mut self, Fn(Self::Item) -> Self::Item) // not full sig, but this is the idea
}

struct Int8Type {
  values: Vec<i8>,
  mask: BitVec, // for storing nulls
}

impl DataType for Int8Type {
  type Item = i8,
  fn apply()...
}

An alternative approach, move trait to a different level to simplify but same basic issue:

struct DataFrame {
  // This seems to be the basic issue: wanting trait objects at runtime, but with a trait that
  // has an associated type.
  columns: Vec<Box<Column>>,
}

trait Column {
  type Item;
  fn apply(&mut self, Fn(Self::Item) -> Self::Item) // not full sig, but this is the idea
}

struct Int8Column {
  values: Vec<i8>,
  mask: BitVec, // for storing nulls
}                            

impl Column for Int8Column {
  type Item = i8,
  fn apply()...
}