parquet_derive docs suggest this method for iterating through a file:
use parquet::record::RecordReader;
use parquet::file::{serialized_reader::SerializedFileReader, reader::FileReader};
use parquet_derive::{ParquetRecordReader};
use std::fs::File;
#[derive(ParquetRecordReader)]
struct ACompleteRecord {
pub a_bool: bool,
pub a_string: String,
}
pub fn read_some_records() -> Vec<ACompleteRecord> {
let mut samples: Vec<ACompleteRecord> = Vec::new();
let file = File::open("some_file.parquet").unwrap();
let reader = SerializedFileReader::new(file).unwrap();
let mut row_group = reader.get_row_group(0).unwrap();
samples.read_from_row_group(&mut *row_group, 1).unwrap();
samples
}
I'm a bit confused by this because the code seems to read a single row from the row group, but Parquet row groups contain n rows.
The docs are sparse - can anyone enlighten me?
https://arrow.apache.org/rust/parquet/record/trait.RecordReader.html#tymethod.read_from_row_group
fn read_from_row_group(
&mut self,
row_group_reader: &mut dyn RowGroupReader,
num_records: usize,
) -> Result<(), ParquetError>
Read up to num_records
records from row_group_reader
into self
.
You're passing 1
for num_records
so it will return at most one record.
1 Like
I'm not doing that, their sample code is. It's verbatim from the linked docs. If that's the sample they offer for processing a file, it's a bit weak.
And there is no way to get the number of rows in the group from a RowGroupReader.
If your inference is correct, it would seem something like this is what's needed:
row_group.get_row_iter(None).unwrap().for_each(|record| {
let row: parquet::record::Row = record.unwrap();
let record = ACompleteRecord::from(row);
samples.push(record);
});
Just pass a large number, more than the maximum records you would expect, but not so large you would use too much memory, for num_records
. It's just a safeguard.
1 Like
Ok... that's sensible.
(This code will handle files around 5 Tb and we don't control the row group size.)
It turns out this doesn't work, because if you pass a number greater than the number of rows in the file, you get this:
thread 'parquet_reader::test_read_records' panicked at services/src/parquet_reader.rs:10:10:
index out of bounds: the len is 66945 but the index is 66945
And, of course, there is no way to get the row count without iterating first.
I think I'm just going to go back to the plain RowIter API:
while let Some(record) = row_iter.next() {
println!("{}", format_row(&record.unwrap(), &delimiter));
}
Not as handy, but it works.
Seems like a bug, since the doc says "up to num_records
". Their tests only pass 1
. I don't see an issue for it in their repo.
1 Like