How to read HDF5 tables

I need to read some HDF5 data, which were originally written in Python using pandas.HDFStore.put(..., format='table')). I assume that this corresponds to HDF5 table format.

The (somewhat spartan) docs of the hdf5 crate appear not to mention 'table' at all.

I'm no HDF5 expert.

Can you share any wisdom on how to get this done?

So far I have confirmed that the size of the dtype() of the Dataset matches what I would expect one record of the table to occupy. I'm guessing that I'd get there eventually if I implement, by hand, something that explains the structure of the table:

#[derive(hdf5::H5Type, Clone, Debug)]
#[repr(C)]
struct MyTableRow {
    column_1: T1,
    column_2: T2,
    ...
}

but I'd hate to do that if there is a way of using the HDF5 metadata to generate this for me.

Also, the main data file I have, weighs in at 1.6G ... and it's not immediately obvious how to access portions of it with the hdf5 crate ... hints welcome.

The idiomatic usage of serialization formats in Rust is exactly that you make user-defined types and you access all data through strongly-typed accessors. Inferring dynamically-typed values is usually a last resort solution. Some serialization formats (e.g. bincode) don't even support that, because the types aren't even encoded in the serialized data.

So, the "I'd hate to do that" part is probably how the author of the hdf5 crate intended others to use the crate.

I appreciate that strongly-typed accessors are a valuable feature of Rust's philosophy, but being self-describing is a valuable feature of HDF5. While dynamically typed languages will naturally use the information dynamically, I would hope that (a mature implementation in) Rust would be able to use the metadata statically (à la bindgen: show it the metadata, and it will spit out the code) to generate a type that describes the contents of some HDF5 component.

In my specific case that motivated this question, the table contains 21 columns, many of them with very similar names, making it very tedious and very easy for the human to make some mistake which will end up being a pain to notice and debug at runtime: this is a job for a computer!

As an added bonus it would be convenient if the crate provided some means of using the metadata dynamically, for quick-n-dirty exploration, but I would certainly prefer to use the static version for anything I would want to rely on.

Dynamic or static, obliging a human to go through the mechanical process of extracting the metadata and translating it into Rust, is a waste of human brain cycles.

I appreciate that the crate is still young, so this is not a criticism: I just wanted to make sure that I wasn't missing anything.

[Now I need to work out how to get a portion of the data out, without having to load it all into memory.]

Here is an outline of the procedure:

  1. struct OneRow describes the table layout
  2. ndarrays s! macro is the most convenient way to tell read_slice_1d which rows of the table should be retrieved.
use std::error::Error;
use std::path::PathBuf;

// The s! macro can be used to specify what slice of the table should be read it
use ndarray::{s, Array1};

// Describe the columns in the table
#[derive(hdf5::H5Type, Clone, PartialEq, Debug)]
#[repr(C)]
pub struct OneRow {
    column1: T1,
    column2: T2,
    // ...
}

pub fn read_table(filename: PathBuf, dataset: String) -> Result<Array1<OneRow>, Box<dyn Error>> {
    let file = ::hdf5::File::open(filename)?;
    let table = file.dataset(&dataset)?;
    let start = 10000;
    let stop  = 20000;
    Ok(table.read_slice_1d::<OneRow,_>(s![start..stop])?)
}

@jacg You should open an issue on this in the repository. As far as I am aware none of the HL functionality of hdf5 is supported yet. Having a macro to extract a hdf5type from an existing file would be awsome (Diesel-style)

Using hdf5 efficiently would involve splitting your dataset, using a struct of arrays instead of an array of structs. You could then set attributes (ex. units) on your datasets to get a self-described format. The hdf5 crate is capable of iterating through datasets, which is what you need when working with dynamic data.

I would further recommend looking at netCDF/ the netcdf crate if you want a further abstraction on top of hdf5, where the dimensionality of multiple datasets can be kept in sync automatically