Flate2 is returning a larger bytes array after compression

I am using flate2 with features = ["zlib-ng"] to compress and decompress Vec<u8>. I wrote a test to verify decompress and the assertion assert!(result.len() > compressed.len()) fails, i.e., the compressed bytes vector is longer than the original bytes vector

I am new to this and I suppose I am missing something trivial, here are the functions:

use std::io::prelude::*;
use std::io::Error
use flate2::write::ZlibDecoder;


pub fn decompress(data: &Vec<u8>) -> Result<Vec<u8>, Error> {
    let mut writer = Vec::new();
    let mut decoder = ZlibDecoder::new(writer);
    decoder.write_all(&data)?;
    writer = decoder.finish()?;
    Ok(writer)
}

and

use std::io::prelude::*;
use std::io::Error;
use flate2::Compression;
use flate2::write::ZlibEncoder;

pub fn compress(data: &Vec<u8>) -> Result<Vec<u8>, Error> {
    let mut encoder = ZlibEncoder::new(Vec::new(), Compression::fast());
    encoder.write_all(data)?;
    Ok(encoder.finish()?)
}

Can someone help me here?

Here is the test I wrote for decompress:

#[cfg(test)]
mod tests {
    use crate::compression::compress;
    use super::*;

    struct TestStruct {
        field1: i32,
        field2: i32,
    }

    impl fmt::Display for TestStruct {
        fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
            write!(f, "field1: {}, field2: {}", self.field1, self.field2)
        }
    }

    #[test]
    fn test_decompress_success() {
        let test_data = TestStruct {
            field1: 10,
            field2: 10,
        };

        let bytes = test_data.to_string().as_bytes().to_vec();
        let original_size = bytes.len();
        let compressed = compress(&bytes).unwrap();

        let result = decompress(&compressed).unwrap();
        
        println!("original {:?}", bytes);
        println!("compressed {:?}", compressed);

        assert!(result.len() > compressed.len()); // this assertion is failing
        assert_eq!(original_size, result.len());
        assert_eq!(bytes, result);
    }
}

What data are you trying to compress? Is it actually compressible?

Any compression algorithm that decreases the length of some inputs has to also increase the length of some other inputs. Proof: pigeonhole principle.

8 Likes

I am trying to serialize the structs in rust. Like create a string representation of a struct and these functions can compress and decompress that data. I am just trying something, is it a futile thing to do? I am not sure.

Is there a better way to do it?

Show us your struct definitions. And, are they in a vector or slice? That is, are you trying to compress many of a struct or just one? How are you performing the conversion to Vec<u8>?

@kpreid I have updated the post with the unit test I have written

Okay, so you are compressing only 22 bytes. It is not surprising that the result is longer. Compression of short strings will not usually benefit because:

  • Compression algorithms perform the best at repetitive data. The only repeated substrings of your data are "field" and "10", each occurring only twice. That means that, loosely speaking, compression of your data can only remove 7 bytes.
  • Compressed data has to, in some way, tell the decompressor how to expand it. Suppose that there is nothing better to do than repeat the original data; you still need to add at least one bit somewhere to express “the following is literal data". In order to handle repeated substrings, you need to be able to point into a table of them, or back at previous output, to say "emit that again".

Basically, compression always has some space overhead, and you are probably in a case where the overhead is bigger than the amount the data could be compressed. You should try compressing a much larger data set in order to see benefits.

10 Likes

Recently I wrote a zip crate where I decide regarding using a compression upon the size of a compressed material. I think the threshold was around 256 bytes. Probably you will find a better value for it. Do not hesitate to share it with us.

Note that DEFLATE includes an entropy coding step, which for small strings might take more space for the decoding table than the string itself.

If you do want to compress small things like this, you could consider using something else that doesn't do that, such as LZ4 (compression algorithm) - Wikipedia -- that expands by only 0.4%, iirc, for incompressible data, so might manage to be not net-worse even if all it saves is the repeat on "field".

You can also use a bunch of different options to tweak things. If you know what it usually looks like, the frame format with a custom dictionary might let you save a bunch of space. Or if not, you might just use the block format directly to avoid the frame metadata overheads.

2 Likes

For serializing structs in Rust, look up the "serde" crate.