Does anyone know of a crate that provides all of the following:
A serde De/Serializer (i.e. data format) impl that
is roughly as succinct as JSON, and allows for de/serializing arbitrary data, but
when serializing a Vec<T> that has value repetitions (e.g. vec![2, 4, 4, 0, 0, 0, 0, 0, 1]) would serialize it in a compressed manner (e.g. [2, 4; 2, 0; 5, 1]), and also
can deserialize that compressed data back to Vec<T> or &[T].
Does anyone know if something like this already exists? The vectors I need to serialize are pretty huge and sparse, and using this form of compression for serialization can get a 1.3GB file down to perhaps 200-300MB. So it's definitely something I want to utilize.
I wonder if using a binary serialization format like bson (superset of JSON, so fitting your second requirement) with some compression algorithm like gzip would be a good enough to reach your goal of 200-300MB?
I just tried just plain gzipping a 3.2GB file. It got compressed down to 12MB, which is a rate of compression (266.67 : 1) I've only ever seen before with procedurally generated content.
So in that sense it's a pretty great solution.
However, to get there, it takes a grand total of 10s, and about 7.6s to decompress, the latter of which is pretty undesirable. I tried to fiddle with the compression level, but at -1 (the lowest valid value) the file compresses down to 34MB, with a compression time of about 6.6s which coincides with a decompression time of about 6.7s.
Those decompression times are too long for my liking, as ideally I'm looking for < 1s in that department.
That said, I don't know what the performance characteristics of a
De/Serializer for the serde data format I mused about above would be, perhaps that would perform similarly.
If no such crate exists at this time then it looks like I'll have to write it.
Does rle_vec provide what you need? At a quick glance, it looks like it might, because it serializes as a sequence of runs, but it might be too verbose in its description of runs (since they'll be objects in the JSON, not pairs).
Thank you for the suggestion. It definitely looks like it would be a nice representation for runs of values (as the lib calls it). And it does have a serialization feature flag. But I'll need to check how that actually serializes.
I tried this immediately because it's pretty much s/gzip/zstd/.
So... It's that I performed the experiment myself, and repeated it a couple of times, just to convince myself.
After compression the *.zst file is 1.8MB to the original's 3.2GB.
That's a compression ratio of 1777.78 : 1 ...
Compression time: 0.5s.
Decompression time of the *.zst file: 1.6s.
That's astounding! For my website (https://firebits.io) I got compression levels of 20X when I did the tests around a year ago, but that's because I'm compressing JSON without much repetitive data.