Apache Avro: repeated save to file

Hi,
I am trying to use apache-avro v.0.18.0, but I don't understand how to use it. To write data, they provider a Writer<'a, W: Write> with lifetime bound to the scheme:

#[derive(bon::Builder)]
pub struct Writer<'a, W: Write> {
    schema: &'a Schema,
    writer: W,
    #[builder(skip)]
    resolved_schema: Option<ResolvedSchema<'a>>,
    #[builder(default = Codec::Null)]
    codec: Codec,
    #[builder(default = DEFAULT_BLOCK_SIZE)]
    block_size: usize,
    #[builder(skip = Vec::with_capacity(block_size))]
    buffer: Vec<u8>,
    #[builder(skip)]
    num_values: usize,
    #[builder(default = generate_sync_marker())]
    marker: [u8; 16],
    #[builder(default = false)]
    has_header: bool,
    #[builder(default)]
    user_metadata: HashMap<String, Value>,
}

My problem is, I want to repeatedly write to a single file. The options I have:

  • keep the writer: How do I store it in a struct with schema? That would lead to a self referential struct. I know there is a crate helping with that, but it's kind of annoying.
  • Create writer on every write: I tried that but ether it is overwriting the whole file, or if opened in append mode, corrupting the file probably because of multiple header writes.
  • There is also pub fn append_to(schema: &'a Schema, writer: W, marker: [u8; 16]) -> Self, which does not write the header, but has this marker argument, that seems to be initialized with random values.

Does anyone know how to do this?

Thanks

I think the marker argument you need to pass is the sync marker that is stored in the file header. Looks like you could use apache_avro::read_marker to get it from a byte slice containing the previously written data. Maybe you could do something like first create a Writer into a Vec, immediately call .into_inner() on it, then read the marker out of it. Then write the bytes to your file and save the marker and File. Then on the next write you could use Writer::append_to with the Schema, File and marker. See also this test: avro-rs/avro/tests/append_to_existing.rs at bab254d546d6f17d65ae4d3259dbe79efaa6f456 · apache/avro-rs · GitHub

Disclaimer: I've never used Apache Avro. I'm just going off the docs and this test I found.

1 Like

Thanks, that helped. It seems, that the marker is just the last written 16 bytes. I read those from the file and it seems to work.
I still do not understand the crate design choice there. Why is Schema not simply Clone and Writer has it as Owener?

I believe cloning the schema would be relatively expensive due to all the memory allocations it needs.