Serde De/Serializer impl with compression support for repeated vector elements

Does anyone know of a crate that provides all of the following:

  • A serde De/Serializer (i.e. data format) impl that
  • is roughly as succinct as JSON, and allows for de/serializing arbitrary data, but
  • when serializing a Vec<T> that has value repetitions (e.g. vec![2, 4, 4, 0, 0, 0, 0, 0, 1]) would serialize it in a compressed manner (e.g. [2, 4; 2, 0; 5, 1]), and also
  • can deserialize that compressed data back to Vec<T> or &[T].

Does anyone know if something like this already exists? The vectors I need to serialize are pretty huge and sparse, and using this form of compression for serialization can get a 1.3GB file down to perhaps 200-300MB. So it's definitely something I want to utilize.

1 Like

I wonder if using a binary serialization format like bson (superset of JSON, so fitting your second requirement) with some compression algorithm like gzip would be a good enough to reach your goal of 200-300MB?

3 Likes

Gzip can definitely do 5x on moderately repetitive data, and maybe even 10x on strongly, highly, deeply, repetitively redundant data.

2 Likes

I just tried just plain gzipping a 3.2GB file. It got compressed down to 12MB, which is a rate of compression (266.67 : 1) I've only ever seen before with procedurally generated content.
So in that sense it's a pretty great solution.

However, to get there, it takes a grand total of 10s, and about 7.6s to decompress, the latter of which is pretty undesirable. I tried to fiddle with the compression level, but at -1 (the lowest valid value) the file compresses down to 34MB, with a compression time of about 6.6s which coincides with a decompression time of about 6.7s.

Those decompression times are too long for my liking, as ideally I'm looking for < 1s in that department.

That said, I don't know what the performance characteristics of a
De/Serializer for the serde data format I mused about above would be, perhaps that would perform similarly.

If no such crate exists at this time then it looks like I'll have to write it.

Does rle_vec provide what you need? At a quick glance, it looks like it might, because it serializes as a sequence of runs, but it might be too verbose in its description of runs (since they'll be objects in the JSON, not pairs).

Try with Zstd instead.

3 Likes

Thank you for the suggestion. It definitely looks like it would be a nice representation for runs of values (as the lib calls it). And it does have a serialization feature flag. But I'll need to check how that actually serializes.

I tried this immediately because it's pretty much s/gzip/zstd/.
So... It's that I performed the experiment myself, and repeated it a couple of times, just to convince myself.

After compression the *.zst file is 1.8MB to the original's 3.2GB.
That's a compression ratio of 1777.78 : 1 ...
Compression time: 0.5s.
Decompression time of the *.zst file: 1.6s.

The efficiency of Zstd is ridiculous.

4 Likes

That's astounding! For my website (https://firebits.io) I got compression levels of 20X when I did the tests around a year ago, but that's because I'm compressing JSON without much repetitive data.

Ultimately what killed the viability of this is the fact that indexing goes from O(1) to O(log N).

What I'm doing is part of lowering runtime of an algorithm from O(log N) to O(1), so anything that asymptotically bumps that back up is a no-go.

Still, cool crate.

1 Like