Right way to use Box and Traits for data known only at runtime?

I'm working on a toy project to replicate pandas's read_csv function as a way to learn Rust. To start, I've implemented the DataFrame struct, but was wondering if this is the right approach, because it seems wasteful/wrong to create a Box pointer for each value in Vec.

trait Value {}
trait ColumnData {}

impl Value for u8 {}
impl Value for f32 {}
impl Value for String {}

struct Column {
    name: String,
    data: Vec<Box<dyn Value>>,
}

struct DataFrame {
    columns: Vec<Column>, 
}

impl DataFrame {}

pub fn test() -> DataFrame {
    DataFrame {
        columns: vec![
            Column {
                name: "col_ints".to_string(),
                data: vec![Box::new(1), Box::new(2), Box::new(3)],
            },
            Column {
                name: "col_str".to_string(),
                data: vec![
                    Box::new("hello".to_string()),
                    Box::new("world".to_string()),
                    Box::new("ok".to_string()),
                ],
            },
        ],
    }
}

I took this approach because I'd only know the data types at runtime when read_csv is called, and each column would store different data types, e.g. one column might store strings, and another floats.

How would you implement this?

Thanks.


EDIT: I came up with an alternative approach. This seems better to me?

trait Value {}

trait ColumnData {}

impl Value for u8 {}
impl Value for f32 {}
impl Value for String {}

#[derive(Debug)]
struct Column<T: Value> {
    name: String,
    data: Vec<T>,
}

impl<T: Value> ColumnData for Column<T> {}

struct DataFrame {
    columns: Vec<Box<dyn ColumnData>>
}

impl DataFrame {}

pub fn test() -> DataFrame {
    DataFrame {
        columns: vec![
            Box::new(Column {
                name: "col_ints".to_string(),
                data: vec![1, 2, 3, 4, 5],
            }),
            Box::new(Column {
                name: "col_strs".to_string(),
                data: vec![
                    "hello".to_string(),
                    "world".to_string(),
                    "ok".to_string(),
                    "123".to_string(),
                    "stuff".to_string(),
                ],
            }),
        ],
    }
}

Your method is feasible because you indeed need to store different types of values in a dynamic environment. However, your implementation may lead to inefficient memory usage because you use Box pointers for each type. Box is used for heap allocation, and should only be used when shared or moving ownership is required.

Part of the issue is also that my current implementation allows a Column to have String and u8 values in data, since both implement the Value trait.

Ideally, Column's data field should only contain a single type, e.g. if an element is String, then all other elements should be String within data.

Should I instead do something like this?

struct Column<T: Value> {
    name: String,
    data: Vec<T>,
}

But now the compiler complains about not passing in the generic type T here:

struct DataFrame {
    columns: Vec<Column>
}

EDIT: edited original post to include a more fully-formed alternative approach of the above

Indeed it is heavy on allocations and pointer chasing. But if you truly need to allow all elements to have a different type, then there's no way around this.

However, I don't think that is the case here. You only assert that different columns need to admit different types, and with a data frame, that is usually the case: columns represent different variables, so they can have different domains, but a given variable will be of a single type.

So instead of boxing each individual element, you could instead move the dynamic dispatch out one level, and implement traits for Vec<ConcreteType> representing columns. I think this is exactly what your second approach does, so you should just go with it.

No, box cannot be used for shared ownership.

3 Likes

Thank you very much for correcting my mistakes.

CSV has a rather limited set of contained types... Why not enum over them?

enum CsvVal{
 String(String),
 Uint8(u8),
 Float(f32),
 ...
}

struct Column {
    name: String,
    data: Vec<CsvVal>,
}
2 Likes

Hmm, wouldn't this mean the data vector can contain String, Uint8 and/or Float together? But this isn't desirable, right, since a column should only have a single type - i.e. it should contain all Float values, not a mix of types.

Building on @H2CO3 's response on the second approach (using trait objects on the Column struct), I also think I'd need to have type-specific columns, e.g. a Uint8Column, StrColumn - partly to perform type-specific methods, e.g. .average().

You'd have instead:

enum CsvColumn {
    String(Vec<String>),
    Uint8(Vec<u8>),
    Float(Vec<f32>),
    ...
}

That's almost exactly the same as the trait object approach, except that the set of types must be closed and known at compile time.

2 Likes

@H2CO3 I tried to extend the trait ColumnData to be able to return a particular Column's data but doing so required defining the generic type T on ColumnData. This meant the DataFrame struct would also have to define T at compile-time and all columns would then have to follow that same type..

Am I missing something?

PS: An alternative I tried (and worked) was defining specific StrColumn structs and then downcasting, but it was very verbose.

trait Value {}

trait ColumnData<T> {
    fn as_any(&self) -> &dyn Any;
    fn get_data(&self) -> &Vec<T>;
}

impl Value for f32 {}
impl Value for String {}

struct Column<T: Value> {
    name: String,
    data: Vec<T>,
}

impl<T: Value> ColumnData<T> for Column<T> {
    fn as_any(&self) -> &dyn Any {
        self
    }

    fn get_data(&self) -> &Vec<T> {
        &self.data
    }
}

struct DataFrame {
    columns: Vec<Box<dyn ColumnData>> // <<<<<<<<< COMPILER ERROR
}

impl DataFrame {}

You are not going to be able to use generics for dynamic typing. Generics are the opposite of dynamic typing.

You'll have to remove the generic type parameter from your trait and add a generic downcasting method on the dyn ColumnData type itself. Playground.

1 Like