Hi everyone,
I'm working with a Rust-based data processing pipeline using the polars
and arrow2
crates. I have a flow where I batch-read CSVs and write them to an Arrow IPC file using IpcWriter
with compression enabled:
let file = File::create(&arrow_file_path).unwrap();
let mut writer = IpcWriter::new(file)
.with_compression(Some(IpcCompression::ZSTD))
.with_parallel(true)
.batched(polars_schema);
while let Some(batches) = csv_batched_reader.next_batches(10).unwrap() {
for df in batches {
let transformed_batch = apply_transformation(df, tikv_schema.transformations.to_owned())?;
let mut updated_batch = rename_columns(transformed_batch, tikv_schema.rename_fields.to_owned())?;
writer.batch(&mut updated_batch).unwrap();
}
}
writer.finish().unwrap();
This part works great and creates a compressed Arrow IPC file (~400MB in size). However, the issue arises when I try to read the file back for further processing.
If I use LazyFrame::scan_ipc
, or even try IpcStreamReader
, the entire file is loaded into memory, causing the process to crash due to high RAM usage (despite having 24 GB of RAM). I believe this is because the file was written in batched mode using IpcWriter::batched
, but I can't find a way to read it back in batches.
My questions are:
- Is there a recommended way to read an Arrow IPC file (written using
IpcWriter::batched
) in batches without loading the entire file into memory? - Can
IpcStreamReader
or any other reader inarrow2
orpolars
read such files incrementally? - Would it be better to use Arrow V1 file format (
FileWriter
/FileReader
) instead of the stream format for this use case? - Any recommended approach for working with large Arrow files efficiently in Rust when both writing and reading in chunks?
Any suggestions or best practices would be greatly appreciated!