Issue Writing Polars DataFrame in Chunks to Arrow/Parquet Without Corruption

Issue Writing Polars DataFrame in Chunks to Arrow/Parquet Without Corruption

What I Am Trying to Do

I'm trying to write a Polars DataFrame in chunks to either an Arrow IPC file or a Parquet file without loading the entire dataset into memory. My goal is to process and write batches iteratively, but I keep encountering file corruption issues when reading the final file.

What I Have Tried

I am using IpcWriter (for Arrow) and ParquetWriter (for Parquet), writing in streaming fashion like this:

while let Some(batches) = csv_batched_reader.next_batches(10).unwrap() {
    for df in batches {
        let transformed_batch = apply_transformation(df, tikv_schema.transformations.to_owned())?;
        let mut updated_batch = rename_columns(transformed_batch, tikv_schema.rename_fields.to_owned())?;

        if first_batch {
            let file = File::create(&arrow_file_path).unwrap();
            let mut writer = IpcWriter::new(file)
                .with_compression(Some(IpcCompression::ZSTD))
                .with_parallel(true);
            writer.finish(&mut updated_batch).unwrap(); // Write first batch with schema
            first_batch = false;
        } else {
            let file = File::options().append(true).open(&arrow_file_path).unwrap();
            IpcWriter::new(file)
                .with_compression(Some(IpcCompression::ZSTD))
                .with_parallel(true)
                .finish(&mut updated_batch)
                .unwrap();
        }
    }

parquet: File out of specification: The page header reported the wrong page size

Arrow IPC appends incorrectly: I suspect File::options().append(true) is causing corruption.

Parquet format doesn't support direct appends: I need a correct approach for writing multiple row groups in a single file.

What I Need Help With:
:white_check_mark: Best way to write Polars DataFrames in chunks to Arrow or Parquet without loading everything into memory.
:white_check_mark: Proper way to append new data in Arrow IPC format without corrupting the file.
:white_check_mark: Correct Parquet approach for handling multiple row groups efficiently.

How to correctly append to a file depends entirely on the specific file format — it may be as simple as .append(true), or not be possible without explicitly re-writing internal metadata/indexes.

What I would suggest as a general strategy applicable to multiple formats is that you should plan to write your batches while keeping the file open. In the case of IpcWriter, it looks like you will want to use batched() instead of finish() in order to achieve this.

let file = File::create(&arrow_file_path).unwrap();
let mut writer = IpcWriter::new(file)
    .with_compression(Some(IpcCompression::ZSTD))
    .with_parallel(true)
    .batched(polars_schema);

while let Some(batches) = csv_batched_reader.next_batches(10).unwrap() {
    for df in batches {
        let transformed_batch = apply_transformation(df, tikv_schema.transformations.to_owned())?;
        let mut updated_batch = rename_columns(transformed_batch, tikv_schema.rename_fields.to_owned())?;

        writer.batch(&mut updated_batch).unwrap();
    }
}
writer.finish().unwrap();
1 Like