Streaming zip archive

Hi,

I'm trying to implement Zip archive creation with incremental streaming to avoid writing an entire copy of the archive in the memory.
Basically this means:

  1. write some data in the zip archive
  2. slice the buffered content of the archive
  3. stream it to Amazon s3
  4. repeat until all data source have been written

Here is an example of that implemented in Java.

In Rust there is this crate to build Zip archive zip.
For the purpose, there is a simple example:

use std::error::Error;
use std::fs::File;
use std::io::{Cursor, Read, Write};
use zip::write::FileOptions;

fn main() -> Result<(), Box<dyn Error>> {
    let cursor: Cursor<Vec<u8>> = Cursor::new(vec![]);
    let mut zip = zip::ZipWriter::new(cursor);
    let options = FileOptions::default().unix_permissions(0o755);

    let mut f1 = File::open("source_file/dark_ocean.png")?;
    let mut i_buf = [0; 256];
    let mut o_buf = [0; 256];
    zip.start_file("dark_ocean.png", options)?;
    loop {
        let n = f1.read(&mut i_buf[..])?;
        if n == 0 {
            break;
        }
        zip.write_all(&i_buf[..n])?;
        let i = cursor.read(&mut o_buf)?;
        // and then send out o_buf content to s3 or whatever
    }
    zip.finish()?;
    Ok(())
}

The problem with zip crate is that I can't write and read in the same time the content of the archive buffer (Cursor::new(vec![]);) to stream it, because the cursor is moved when the ZipWritter is created zip::ZipWriter::new(cursor). I can access to the archive buffer only after I call zip.finish(). But then I have to wait that all the data has been put into the archive, and this is not what I want.
I want to stream the content incrementally while it's being written.

So I'm not sure if this is possible to do with the zip crate. And I have not really the choice since I'm interested only by zip format. And the zip crate seems to be the only solution available in Rust for zip archive creation.

Note: That implementation is also possible in JS with this lib -> archiver - npm
I share this to better explain what I'm looking for.

That's not a problem with the crate, that's a problem with your mental model :upside_down_face: Probably the single biggest invention of Rust is disallowing simultaneous readers and writers, which prevents data races and memory management issues.

If you are 100% sure you are never going to read while writing (i.e. your code ensures that readers and writers take turns), then you have two possibilities:

  1. the more type-safe but slightly more cumbersome way, which I would prefer, is to rearrange your code in such a way that the Cursor is only created temporarily with a mutable reference to the vector (there's a blanket impl Write for &mut W where W: Write so that ought to work), and its position is remembered (and reset again and again) in a separate variable.
  2. You can use interior mutability and wrap the Cursor in a RefCell so that reading XOR writing is enforced at runtime. (This will allow you to mutate the cursor through shared references, and once again relies on the blanket impl mentioned above.)

I don't see anything wrong with your use case, but clearly passing Cursor to ZipWriter::new won't work. You'll have to implement your own Write + Seek implementation that performs the streaming.

The Java example you link to seems a bit faulty as it buffers the entire zip file in memory (even though it overlaps sending already compressed parts with the compression of later parts.)

Aside from that, even if you allowed for buffering the entire zipped content in memory, I think in Rust this will require unsafe code (or a crate) since you have no way to express "read from the first 5MB of this Vec in a separate thread" while the second 5MB of the same Vec are being written to in safe Rust.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.