Parquet Reading

parquet_derive docs suggest this method for iterating through a file:

use parquet::record::RecordReader;
use parquet::file::{serialized_reader::SerializedFileReader, reader::FileReader};
use parquet_derive::{ParquetRecordReader};
use std::fs::File;

#[derive(ParquetRecordReader)]
struct ACompleteRecord {
    pub a_bool: bool,
    pub a_string: String,
}

pub fn read_some_records() -> Vec<ACompleteRecord> {
  let mut samples: Vec<ACompleteRecord> = Vec::new();
  let file = File::open("some_file.parquet").unwrap();

  let reader = SerializedFileReader::new(file).unwrap();
  let mut row_group = reader.get_row_group(0).unwrap();
  samples.read_from_row_group(&mut *row_group, 1).unwrap();
  samples
}

I'm a bit confused by this because the code seems to read a single row from the row group, but Parquet row groups contain n rows.

The docs are sparse - can anyone enlighten me?

https://arrow.apache.org/rust/parquet/record/trait.RecordReader.html#tymethod.read_from_row_group

fn read_from_row_group(
    &mut self,
    row_group_reader: &mut dyn RowGroupReader,
    num_records: usize,
) -> Result<(), ParquetError>

Read up to num_records records from row_group_reader into self.

You're passing 1 for num_records so it will return at most one record.

1 Like

I'm not doing that, their sample code is. It's verbatim from the linked docs. If that's the sample they offer for processing a file, it's a bit weak.

And there is no way to get the number of rows in the group from a RowGroupReader.

If your inference is correct, it would seem something like this is what's needed:

        row_group.get_row_iter(None).unwrap().for_each(|record| {
            let row: parquet::record::Row = record.unwrap();
            let record = ACompleteRecord::from(row);
            samples.push(record);
        });

Just pass a large number, more than the maximum records you would expect, but not so large you would use too much memory, for num_records. It's just a safeguard.

1 Like

Ok... that's sensible.

(This code will handle files around 5 Tb and we don't control the row group size.)

It turns out this doesn't work, because if you pass a number greater than the number of rows in the file, you get this:

thread 'parquet_reader::test_read_records' panicked at services/src/parquet_reader.rs:10:10:
index out of bounds: the len is 66945 but the index is 66945

And, of course, there is no way to get the row count without iterating first.

I think I'm just going to go back to the plain RowIter API:

    while let Some(record) = row_iter.next() {
        println!("{}", format_row(&record.unwrap(), &delimiter));
    }

Not as handy, but it works.

Seems like a bug, since the doc says "up to num_records". Their tests only pass 1. I don't see an issue for it in their repo.

1 Like