Hi,
what would be the best way to read log files that are inside of nested zip files without extracting the files first? I was hoping that this would work using let archive = zip::read::ZipArchive, but I didn't have any luck so far. While I can get the names of contained files using archive.file_names, I don't see a way to use these names as handles to open the underlying zip as a new ZipArchive.
The structure looks similar to this (can be deeper than 2 levels):
z1.zip
z10.zip
text01.txt
text0n.txt
z11.zip
text11.txt
text1n.txt
Each zip can contain a large number of files, so I want to avoid extracting the first level to disk as that will be very costly. I'd also like to delete some of the .txt files from within the zips to reduce size on disc - that should be straightforward once I have the handles that I need to traverse the zipped files I think. I'm still new to Rust, so I might be missing an obvious solution here?
Unfortunately, it looks like zip::read::ZipFile doesn't implement std::io::Seek. If it did, you'd be able to pass a ZipFile as the Read+Seek reader to ZipArchive::new() and search the nested archive just like any other.
You don't need to extract the nested zip archive to disk in order to inspect its contents though. As long as it fits in ram, you could read z10.zip into a buffer, pass it to ZipArchive::new(), then get the text01.txt using by_name().
Thanks for the feedback. Yes, the missing std::io::Seek was the error message I was getting, forgot to mention that above.
I wanted to avoid reading it into memory because I'm actually interested in only a few text files within the archives (maybe ~1%), so was worried about throughput. I'll try to implement that anyway and see how good/bad the performance really is.
If memory use is a big concern, you can also write a wrapper that implements Seek by reopening the ZipFile for backwards jumps. It would look something like this (untested):
use std::io;
use std::io::prelude::*;
struct SeekWrapper<F: Read, OpenFn: Fn() -> F> {
open: OpenFn,
reader: F,
pos: u64,
}
impl<F: Read, OpenFn: Fn() -> F> SeekWrapper<F, OpenFn> {
pub fn new(open: OpenFn) -> Self {
let reader = open();
SeekWrapper {
open,
reader,
pos: 0,
}
}
fn discard(&mut self, mut count: usize) -> io::Result<()> {
let mut buf = [0u8; 1024]; // MaybeUninit could be useful here
while count != 0 {
let batchsize = buf.len().min(count);
self.reader.read_exact(&mut buf[..batchsize])?;
self.pos += batchsize as u64;
count -= batchsize;
}
Ok(())
}
}
impl<F: Read, OpenFn: Fn() -> F> Read for SeekWrapper<F, OpenFn> {
fn read(&mut self, buf: &mut [u8]) -> std::io::Result<usize> {
let bytes = self.reader.read(buf)?;
self.pos += bytes as u64;
Ok(bytes)
}
}
impl<F: Read, OpenFn: Fn() -> F> Seek for SeekWrapper<F, OpenFn> {
fn seek(&mut self, pos: io::SeekFrom) -> io::Result<u64> {
use io::SeekFrom::*;
match pos {
Start(pos) => {
self.reader = (self.open)();
self.pos = 0;
self.discard(pos as usize)?;
}
End(_) => {
unimplemented!();
}
Current(offset) => {
if offset >= 0 {
self.discard(offset as usize)?;
} else {
self.seek(Start(self.pos - (-offset) as u64))?;
}
}
}
Ok(self.pos)
}
}