Macro to reduce code duplication in match

In the code base of arrow2 we use trait objects and I constantly have to write things like:

pub fn new_null_array(data_type: DataType, length: usize) -> Box<dyn Array> {
    match data_type {
        DataType::Null => Box::new(NullArray::new_null(length)),
        DataType::Boolean => Box::new(BooleanArray::new_null(length)),
        DataType::Int8 => Box::new(PrimitiveArray::<i8>::new_null(data_type, length)),
        DataType::Int16 => Box::new(PrimitiveArray::<i16>::new_null(data_type, length)),
        DataType::Int32
        | DataType::Date32
        | DataType::Time32(_)
        | DataType::Interval(IntervalUnit::YearMonth) => {
            Box::new(PrimitiveArray::<i32>::new_null(data_type, length))
        }
        DataType::Interval(IntervalUnit::DayTime) => {
            Box::new(PrimitiveArray::<days_ms>::new_null(data_type, length))
        }
        DataType::Int64
        | DataType::Date64
        | DataType::Time64(_)
        | DataType::Timestamp(_, _)
        | DataType::Duration(_) => Box::new(PrimitiveArray::<i64>::new_null(data_type, length)),
        DataType::Decimal(_, _) => Box::new(PrimitiveArray::<i128>::new_null(data_type, length)),
        DataType::UInt8 => Box::new(PrimitiveArray::<u8>::new_null(data_type, length)),
        DataType::UInt16 => Box::new(PrimitiveArray::<u16>::new_null(data_type, length)),
        DataType::UInt32 => Box::new(PrimitiveArray::<u32>::new_null(data_type, length)),
        DataType::UInt64 => Box::new(PrimitiveArray::<u64>::new_null(data_type, length)),
        DataType::Float16 => unreachable!(),
        DataType::Float32 => Box::new(PrimitiveArray::<f32>::new_null(data_type, length)),
        DataType::Float64 => Box::new(PrimitiveArray::<f64>::new_null(data_type, length)),
        DataType::Binary => Box::new(BinaryArray::<i32>::new_null(length)),
        DataType::LargeBinary => Box::new(BinaryArray::<i64>::new_null(length)),
        DataType::FixedSizeBinary(_) => Box::new(FixedSizeBinaryArray::new_null(data_type, length)),
        DataType::Utf8 => Box::new(Utf8Array::<i32>::new_null(length)),
        DataType::LargeUtf8 => Box::new(Utf8Array::<i64>::new_null(length)),
        DataType::List(_) => Box::new(ListArray::<i32>::new_null(data_type, length)),
        DataType::LargeList(_) => Box::new(ListArray::<i64>::new_null(data_type, length)),
        DataType::FixedSizeList(_, _) => Box::new(FixedSizeListArray::new_null(data_type, length)),
        DataType::Struct(fields) => Box::new(StructArray::new_null(&fields, length)),
        DataType::Union(_, _, _) => Box::new(UnionArray::new_null(data_type, length)),
        DataType::Dictionary(key_type, value_type) => match key_type.as_ref() {
            DataType::Int8 => Box::new(DictionaryArray::<i8>::new_null(*value_type, length)),
            DataType::Int16 => Box::new(DictionaryArray::<i16>::new_null(*value_type, length)),
            DataType::Int32 => Box::new(DictionaryArray::<i32>::new_null(*value_type, length)),
            DataType::Int64 => Box::new(DictionaryArray::<i64>::new_null(*value_type, length)),
            DataType::UInt8 => Box::new(DictionaryArray::<u8>::new_null(*value_type, length)),
            DataType::UInt16 => Box::new(DictionaryArray::<u16>::new_null(*value_type, length)),
            DataType::UInt32 => Box::new(DictionaryArray::<u32>::new_null(*value_type, length)),
            DataType::UInt64 => Box::new(DictionaryArray::<u64>::new_null(*value_type, length)),
            _ => unreachable!(),
        },
    }
}

You can think of DataType representing a logical (semantic) type, and the different structs representing the physical (in-memory) representation.

since the DataType enum describes which physical representation Box<dyn Array> is made of. There are probably about +10 places where these or small variations of these need to be written.

Since this pattern is shown a couple of times (with small variations), I wonder: is there a macro (or macro pattern) that can work this out without me having to write all of the above?

1 Like

FWIW, "sprawling matches" is a problem which plagues rust-analyzer (example, and today there are probably hundreds matches like that). Our problem seems to be slightly different -- our enums have fewer variants, but there are a lot of different enums. We didn't find a better solution than to write matches ourselves.

One way to make it generic is to add a distinct ZST type for each DataType, then you can write generic functions:

struct Boolean;
struct UInt32;

trait Repr {
  type Single;
  type Array;
}

impl Repr for Boolen {
  type Single = bool;
  type Array = Vec<u8>;
}

impl Repr for UInt32 {
  type Single = u32;
  type Array = Vec<u32>;
}

enum DataType { Boolean(Boolean), Uint32(UInt32) }

Then, you'll be able to write fn new_null_array<T: Repr>(lengh: usize) -> Box<dyn Array>. With this, you can make it that every match arm is exactly the same, and that you can write with a declarative macro.

In rust-analyzer, we experimented a bit with such visitor style API, but in the end decided that simple match and multiple cursors in the editor is a more maintainable solution.

2 Likes

So, I've been starting at that match of yours, and it's still not consistent enough for everything to be handled by a macro (or some other API trick) in a simple way.

One thing you can already do is indeed to factor out the boxes:

macro_rules! boxed_rhs {(
    match $scrutinee:expr, {
        $(
            $(|)? $($pat:pat)|+ $(if $guard:expr)? => $rhs:expr
        ),+ $(,)?
    }
) => (
    match $scrutinee {
        $(
            $(| $pat)+ $(if $guard)? => ::std::boxed::Box::new($rhs),
        )+
    }
)}

That way already you can save the Box::new() repetition by writing:

boxed_rhs!(match data_type, {
    DataType::Null => NullArray::new_null(length),
    …
})

To go further than that you seem to need to perform some delimitations among your variants "kinds", probably nesting the enum with inner enums, so as to have, based on my observation:

  • null,
  • boolean,
  • primitives (nested enum!)
  • binary arrays,
  • a bunch of specific one-ofs such as fixed size binary array
  • dictionary arrays (this one already features a sub-enum, with easy to factor repetition, so it's perfect for a macro);

An example for the dictionary array:

macro_rules! with_match_dictionary_key_type {(
    $key_type:expr, | $_:tt $T:ident | $($body:tt)*
) => ({
    macro_rules! __with_ty__ {( $_ $T:ident ) => ( $($body)* )}
    match $key_type {
        DataType::Int8 => __with_ty__! { i8 },
        DataType::Int16 => __with_ty__! { i16 },
        DataType::Int32 => __with_ty__! { i32 },
        …
        DataType::UInt64 => __with_ty__! { u64 },
        _ => ::core::unreachable!("A dictionary key type ought to feature only integer types"),
    }
})}

to be used as:

DataType::Dictionary(key_type, &value_type) => {
    with_match_dictionary_key_type!(key_type.as_ref(), |$T| {
        Box::new(DictionaryArray::<$T>::new_null(value_type, length))
    })
},
2 Likes

This was already useful (+138 −446). Thank you, @Yandros .

@matklad ahaha, good to know that we are not alone here!

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.