Issue Writing Polars DataFrame in Chunks to Arrow/Parquet Without Corruption

NiravPatel · April 3, 2025, 7:37am

Issue Writing Polars DataFrame in Chunks to Arrow/Parquet Without Corruption

What I Am Trying to Do

I'm trying to write a Polars DataFrame in chunks to either an Arrow IPC file or a Parquet file without loading the entire dataset into memory. My goal is to process and write batches iteratively, but I keep encountering file corruption issues when reading the final file.

What I Have Tried

I am using IpcWriter (for Arrow) and ParquetWriter (for Parquet), writing in streaming fashion like this:

while let Some(batches) = csv_batched_reader.next_batches(10).unwrap() {
    for df in batches {
        let transformed_batch = apply_transformation(df, tikv_schema.transformations.to_owned())?;
        let mut updated_batch = rename_columns(transformed_batch, tikv_schema.rename_fields.to_owned())?;

        if first_batch {
            let file = File::create(&arrow_file_path).unwrap();
            let mut writer = IpcWriter::new(file)
                .with_compression(Some(IpcCompression::ZSTD))
                .with_parallel(true);
            writer.finish(&mut updated_batch).unwrap(); // Write first batch with schema
            first_batch = false;
        } else {
            let file = File::options().append(true).open(&arrow_file_path).unwrap();
            IpcWriter::new(file)
                .with_compression(Some(IpcCompression::ZSTD))
                .with_parallel(true)
                .finish(&mut updated_batch)
                .unwrap();
        }
    }

parquet: File out of specification: The page header reported the wrong page size

Arrow IPC appends incorrectly: I suspect File::options().append(true) is causing corruption.

Parquet format doesn't support direct appends: I need a correct approach for writing multiple row groups in a single file.

What I Need Help With:
Best way to write Polars DataFrames in chunks to Arrow or Parquet without loading everything into memory.
Proper way to append new data in Arrow IPC format without corrupting the file.
Correct Parquet approach for handling multiple row groups efficiently.

kpreid · April 3, 2025, 8:12pm

How to correctly append to a file depends entirely on the specific file format — it may be as simple as .append(true), or not be possible without explicitly re-writing internal metadata/indexes.

What I would suggest as a general strategy applicable to multiple formats is that you should plan to write your batches while keeping the file open. In the case of IpcWriter, it looks like you will want to use batched() instead of finish() in order to achieve this.

let file = File::create(&arrow_file_path).unwrap();
let mut writer = IpcWriter::new(file)
    .with_compression(Some(IpcCompression::ZSTD))
    .with_parallel(true)
    .batched(polars_schema);

while let Some(batches) = csv_batched_reader.next_batches(10).unwrap() {
    for df in batches {
        let transformed_batch = apply_transformation(df, tikv_schema.transformations.to_owned())?;
        let mut updated_batch = rename_columns(transformed_batch, tikv_schema.rename_fields.to_owned())?;

        writer.batch(&mut updated_batch).unwrap();
    }
}
writer.finish().unwrap();

system · July 2, 2025, 8:13pm

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
How to read large Arrow IPC files in batches for transformation with low memory usage? help	1	193	July 4, 2025
IPC and DataFrames code review	1	438	July 23, 2021
How to write an Apache Arrow file	2	1585	May 29, 2021
Polars Dataframe: Writing IPC bytes directly to HTTP via Trait std::io::Write help	2	1027	September 13, 2023
High-level approach to IO of iterable-of-Rust-struct <-> parquet help	1	317	August 14, 2023

Issue Writing Polars DataFrame in Chunks to Arrow/Parquet Without Corruption

Issue Writing Polars DataFrame in Chunks to Arrow/Parquet Without Corruption

What I Am Trying to Do

What I Have Tried

Related topics