How to use RecordBatch in Arrow when using datafusion?

Hi All,

I am playing with Datafusion to read a CSV file. So, I tried the example in [github].(arrow-datafusion/csv_sql.rs at master · apache/arrow-datafusion · GitHub).

I am able to run the script and see the output in console. Now, I want to transform the results to a custom struct which looks like below,
pub struct MyStruct {
pub c1: String,
pub min: f32,
pub max: f32
}

I tried let results: Vec<RecordBatch> = df.collect().await?; but don't know how to get each cell values from Vec<RecordBatch>. I searched their document but it is always ending with .show() and do not mention about RecordBatch. Does anyone know how I can do get a Vec<MyStruct>?

I had the same difficulty when first starting with Arrow; as there's a bunch of layers to unwrap:

  1. RecordBatch is a collection of ArrayRef (which is basically a generic array, which can then be cast to a specific array type like an array of f32's, or an array of Utf8 strings). The RecordBatch also contains a schema describing the column names and their specific types
  2. There is a Vec<RecordBatch> because the data might be split into separate chunks for parallelism - basically just treat each the same (there might be a single RecordBatch, or there might be many each with different numbers of rows - depending on how the reader and datafusion decide to split things up - but they should all have the same columns)

So you would need to:

  1. Loop over each of the RecordBatch - asserting that the schema is the same as the first one to be safe
  2. For each RecordBatch, loop over the columns you are interested in (e.g iterating over the schema.fields if you want all of them), giving you an &ArrayRef
  3. Then do something like .as_any().downcast_ref::<array::Float32Array>() on the ref to get the concerete array type, which you can then loop over to get the actual values.

The last part can be tricky if you want to handle input-types more flexible (e.g what if the values are f64? or i32? or.. ) - but if you expect only f32's then you just downcast it to Float32Array and if the covnersion fails treat it as an error

I found the pretty-printer code useful as reference for this, most of the logic being in here in util/display.rs

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.