Arrow/parquet: official vs *2

The crates

are the official Rust implementations which are part of the Apache Arrow project. There are also these

Have you any experience relevant to helping decide which implementations to use when trying to explore arrow-based technologies in Rust? (I have no previous Arrow experience in any other language.)

When I used it, the API for the official Rust implementation didn't feel like idiomatic Rust, while arrow2 was a lot nicer to use.

It's been about 7 months since then though, so I can't tell you exactly which bits felt unidiomatic.

1 Like

We chose arrow2 and it definitely feels a lot more like I'm using a Rust library. No idea if the official one is getting better, but it was quite bad at the time we chose to switch. (like direct calling malloc and not handling malloc returning nullptr, abusing transmute etc.

2 Likes

There has been a conscious effort to port many of the great ideas from arrow2/parquet2 across over the last year, so I would encourage you to try both out and make your own judgement.

Some key features of both that may help you reach a decision:

  • Strongly typed APIs - both arrow and arrow2 provide strongly typed array and builder abstractions, along with corresponding kernels
  • Type-erased APIs - arrow provides type-erased kernels and array abstractions for types not known at compile time, arrow2 does not
  • Unchecked APIs - arrow provides unchecked, unsafe APIs as an escape valve (much like the standard library) for invariants that are not easily expressed, arrow2 has limited support
  • Sound - both crates don't allow UB through safe APIs
  • Transmute-free - both crates only perform checked transmutation
  • Dependency size - arrow is distributed as multiple crates, arrow2 uses features
  • Nested parquet support - arrow2 has limited support for structured types in parquet, parquet has full support
  • Parquet Predicate Pushdown - arrow supports both indexed and lazy materialization of parquet predicates, parquet2 does not
  • Parquet performance - performance is comparable in the absence of predicates, and can drastically favor parquet if predicates are provided
  • Async support - both crates support async reading of parquet, arrow2 supports async writing
  • ObjectStore Support - parquet integrates easily with object storage which only supports range requests, arrow2's support for this is fairly primitive
  • SQL support - SQL support for arrow is provided by DataFusion, not sure about arrow2
  • Dataframe support - arrow has DataFusion, arrow2 has polars
  • Governance - arrow is an Apache project with a group of maintainers, arrow2 is largely maintained by Jorge (who is also an Arrow PMC member)

Disclaimer: I maintain the arrow and parquet crates

3 Likes