October 19, 2022, 12:13pm
are the official Rust implementations which are part of the Apache Arrow project. There are also these
Have you any experience relevant to helping decide which implementations to use when trying to explore arrow-based technologies in Rust? (I have no previous Arrow experience in any other language.)
When I used it, the API for the official Rust implementation didn't feel like idiomatic Rust, while
arrow2 was a lot nicer to use.
It's been about 7 months since then though, so I can't tell you exactly which bits felt unidiomatic.
arrow2 and it definitely feels a lot more like I'm using a Rust library. No idea if the official one is getting better, but it was quite bad at the time we chose to switch. (like direct calling malloc
and not handling malloc returning nullptr, abusing
There has been a conscious effort to port many of the great ideas from arrow2/parquet2 across over the last year, so I would encourage you to try both out and make your own judgement.
Some key features of both that may help you reach a decision:
Strongly typed APIs - both arrow and arrow2 provide strongly typed array and builder abstractions, along with corresponding kernels
Type-erased APIs - arrow provides type-erased kernels and array abstractions for types not known at compile time, arrow2 does not
Unchecked APIs - arrow provides unchecked, unsafe APIs as an escape valve (much like the standard library) for invariants that are not easily expressed, arrow2 has limited support
Sound - both crates don't allow UB through safe APIs
Transmute-free - both crates only perform checked transmutation
Dependency size - arrow is distributed as multiple crates, arrow2 uses features
Nested parquet support - arrow2 has limited support for
structured types in parquet, parquet has full support Parquet Predicate Pushdown - arrow supports both indexed and lazy materialization of parquet predicates, parquet2 does not
Parquet performance - performance is comparable in the absence of predicates, and can
drastically favor parquet if predicates are provided Async support - both crates support async reading of parquet, arrow2 supports async writing
ObjectStore Support - parquet integrates easily with object storage which only supports range requests, arrow2's support for this is fairly primitive
SQL support - SQL support for arrow is provided by DataFusion, not sure about arrow2
Dataframe support - arrow has DataFusion, arrow2 has polars
Governance - arrow is an Apache project with a group of maintainers, arrow2 is largely maintained by Jorge (who is also an Arrow PMC member)
Disclaimer: I maintain the arrow and parquet crates
January 19, 2023, 8:52pm
This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.