Status of Avro in Rust

scimas · July 8, 2022, 8:21pm

I need help figuring out how to deal with Avro data in rust. I can't figure out which crate to use.

Serde documentation redirects to avro-rs which has a big warning "avro-rs is no longer maintained here." It, in turn, redirects you to a sub-directory of apache avro. And the crates.io badge on the README there just points you to v0.0.1 of the crate. Going to crates.io and looking for the apache-avro crate also only gives the v0.0.1. The docs.rs page for the crate is just blank, the source is just the starting library code you get from cargo with a test module.

But if you descend into the Cargo.toml of the avro sub-directory in apache avro, it says the version is 0.14.0 and you can actually see some code.

So I'm really confused as to how an end user is supposed to use the library. The overall avro documentation, wiki don't even mention rust anywhere, so it's not useful either.

jonh · July 8, 2022, 8:42pm

Mailing list of https://avro.apache.org/ maybe best place to ask.

jorgecarleitao · July 9, 2022, 2:33am

Have you tried arrow2? We have quite extensive support for reading from (example with sync and async), and writing to (example) Apache Avro.

We support essentially everything except schema evolution and default values.

scimas · July 9, 2022, 3:29am

That's almost perfect! I have already experimented with arrow2, but totally missed that it supports avro IO (I've only used it with parquet, json and csv).

scimas · July 10, 2022, 12:14am

Is there a way to read avro "files" without the header with arrow2? Specifically I'm trying to process messages received on kafka where the schema is provided by a schema registry. I can get the schema separately. But I don't understand what the "marker" is.

jorgecarleitao · July 10, 2022, 5:16am

Do you have a file or byte stream example (i.e. the actual bytes) that I could take a look at to understand your question better?

An Avro file is usually composed by

a schema (on the header)
some metadata (on the header)
a file marker (16 bytes) (on the header)
a sequence of blocks

each block should contain, at its end, the same "marker" as the file (it is basically a mechanism to ensure that the stream has not been corrupted). When no header exists, I imagine that there will be a similar mechanism. However, we would need to know what is the spec that Kafka uses when writing to the stream in Avro - it could be that it does not use this mechanism (and we can of course address this in arrow2 - we just need to know what the spec is ^^)

scimas · July 10, 2022, 5:59pm

From what I can gather from the MessageSerializer in confluent kafka python code, this is how they're writing the message. First 5 bytes are special, first byte is always b'0' (MAGIC_BYTE) followed by the schema_version encoded in 4 bytes. And then the rest of it is the data as written by fastavro's schemaless_writer.

For example, the record

{"firstname": "sci", "lastname": "mas"}

with schema

{"name": "test_schema", "type": "record", "fields": [{"name": "firstname", "type": "string"}, {"name": "lastname", "type": "string"}]}

produces these bytes as the kafka payload:

[0, 0, 0, 0, 1, 6, 115, 99, 105, 6, 109, 97, 115]

The consumer does the opposite:

Ensure first byte is b'0'.
Use bytes 1..=4 to retrieve the schema from the schema registry.
Use fastavro's schemaless_reader to decode the rest of the payload.

jorgecarleitao · July 11, 2022, 4:45am

This was pretty helpful! I wrote a PR with an example here: Added example reading Avro produced by Kafka by jorgecarleitao · Pull Request #1151 · jorgecarleitao/arrow2 · GitHub, with what I think are the main ideas. Would you mind reviewing it?

scimas · July 12, 2022, 4:19am

Will review it and give comments on github.

system · October 10, 2022, 4:20am

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Avro support for Rust	3	2199	January 12, 2023
Parquet in Rust in January 2024 help	1	141	April 18, 2024
Arrow/parquet: official vs *2 help	4	1050	January 19, 2023
Is there a crate to read *.toml files?	7	6081	August 22, 2021
New crate: include_data. Review for soundness and API code review	7	425	October 9, 2023

Status of Avro in Rust

Related Topics