Issue Writing Polars DataFrame in Chunks to Arrow/Parquet Without Corruption
What I Am Trying to Do
I'm trying to write a Polars DataFrame in chunks to either an Arrow IPC file or a Parquet file without loading the entire dataset into memory. My goal is to process and write batches iteratively, but I keep encountering file corruption issues when reading the final file.
What I Have Tried
I am using IpcWriter
(for Arrow) and ParquetWriter
(for Parquet), writing in streaming fashion like this:
while let Some(batches) = csv_batched_reader.next_batches(10).unwrap() {
for df in batches {
let transformed_batch = apply_transformation(df, tikv_schema.transformations.to_owned())?;
let mut updated_batch = rename_columns(transformed_batch, tikv_schema.rename_fields.to_owned())?;
if first_batch {
let file = File::create(&arrow_file_path).unwrap();
let mut writer = IpcWriter::new(file)
.with_compression(Some(IpcCompression::ZSTD))
.with_parallel(true);
writer.finish(&mut updated_batch).unwrap(); // Write first batch with schema
first_batch = false;
} else {
let file = File::options().append(true).open(&arrow_file_path).unwrap();
IpcWriter::new(file)
.with_compression(Some(IpcCompression::ZSTD))
.with_parallel(true)
.finish(&mut updated_batch)
.unwrap();
}
}
parquet: File out of specification: The page header reported the wrong page size
Arrow IPC appends incorrectly: I suspect File::options().append(true) is causing corruption.
Parquet format doesn't support direct appends: I need a correct approach for writing multiple row groups in a single file.
What I Need Help With:
Best way to write Polars DataFrames in chunks to Arrow or Parquet without loading everything into memory.
Proper way to append new data in Arrow IPC format without corrupting the file.
Correct Parquet approach for handling multiple row groups efficiently.