Status of Avro in Rust

I need help figuring out how to deal with Avro data in rust. I can't figure out which crate to use.

Serde documentation redirects to avro-rs which has a big warning "avro-rs is no longer maintained here." It, in turn, redirects you to a sub-directory of apache avro. And the crates.io badge on the README there just points you to v0.0.1 of the crate. Going to crates.io and looking for the apache-avro crate also only gives the v0.0.1. The docs.rs page for the crate is just blank, the source is just the starting library code you get from cargo with a test module.

But if you descend into the Cargo.toml of the avro sub-directory in apache avro, it says the version is 0.14.0 and you can actually see some code.

So I'm really confused as to how an end user is supposed to use the library. The overall avro documentation, wiki don't even mention rust anywhere, so it's not useful either.

1 Like

Mailing list of https://avro.apache.org/ maybe best place to ask.

Have you tried arrow2? We have quite extensive support for reading from (example with sync and async), and writing to (example) Apache Avro.

We support essentially everything except schema evolution and default values.

3 Likes

That's almost perfect! I have already experimented with arrow2, but totally missed that it supports avro IO (I've only used it with parquet, json and csv).

1 Like

Is there a way to read avro "files" without the header with arrow2? Specifically I'm trying to process messages received on kafka where the schema is provided by a schema registry. I can get the schema separately. But I don't understand what the "marker" is.

Do you have a file or byte stream example (i.e. the actual bytes) that I could take a look at to understand your question better?

An Avro file is usually composed by

  • a schema (on the header)
  • some metadata (on the header)
  • a file marker (16 bytes) (on the header)
  • a sequence of blocks

each block should contain, at its end, the same "marker" as the file (it is basically a mechanism to ensure that the stream has not been corrupted). When no header exists, I imagine that there will be a similar mechanism. However, we would need to know what is the spec that Kafka uses when writing to the stream in Avro - it could be that it does not use this mechanism (and we can of course address this in arrow2 - we just need to know what the spec is ^^)

From what I can gather from the MessageSerializer in confluent kafka python code, this is how they're writing the message. First 5 bytes are special, first byte is always b'0' (MAGIC_BYTE) followed by the schema_version encoded in 4 bytes. And then the rest of it is the data as written by fastavro's schemaless_writer.

For example, the record

{"firstname": "sci", "lastname": "mas"}

with schema

{"name": "test_schema", "type": "record", "fields": [{"name": "firstname", "type": "string"}, {"name": "lastname", "type": "string"}]}

produces these bytes as the kafka payload:

[0, 0, 0, 0, 1, 6, 115, 99, 105, 6, 109, 97, 115]

The consumer does the opposite:

  1. Ensure first byte is b'0'.
  2. Use bytes 1..=4 to retrieve the schema from the schema registry.
  3. Use fastavro's schemaless_reader to decode the rest of the payload.
1 Like

This was pretty helpful! I wrote a PR with an example here: Added example reading Avro produced by Kafka by jorgecarleitao · Pull Request #1151 · jorgecarleitao/arrow2 · GitHub, with what I think are the main ideas. Would you mind reviewing it?

1 Like

Will review it and give comments on github.

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.