Which compression crate should I use?

I'm revamping the include_dir crate and one feature I'd like to implement is automatic compression of all files that are embedded in your binary.

Are there any pure Rust compression crates you would recommend?

There are a bunch of crates on crates.io, but I'm not sure which one would best suit my goals:

  • Compress and decompress - there are quite a few crates that only implement compression
  • Trivial to cross-compile - I don't want my users to mess around with setting up a suitable cross-compiling C toolchain
  • Low impact on build times - ideally it wouldn't have a many/any dependencies

In terms of performance, I'll be using this compression library to compress files inside a procedural macro and lazily decompress the data once at runtime. That means it's okay for decompression to take a bit longer, but compressing should be pretty quick so builds don't take forever.

2 Likes

Is there a way to hack the procedural macro so that it checks the dir modification / .bz2 cache, so that:

  1. on changes, it runs tar ... | bzip2 ...
  2. on ".bz2 newer than dir", it just reduces to an include_bytes of the existing cached .bz2 ?

It depends how hacky we want to be. Procedural macros don't get an $OUT_DIR like crates with a build script, so you would need to stash the tar ball in /tmp or something... I'd prefer to avoid things like that though because it'll add a lot of extra complexity. The entire crate including runtime code and procedural macro is maybe 600 lines in total :sweat_smile:

I'll probably steer away from libraries which store everything in a single blob (e.g. tar balls and zip archives) because that would require switching a lot of the runtime crate's internals depending on whether a hypothetical compression feature flag is enabled.

For more context, you can think of the include_dir!() macro as generating a literal like this:

Dir {
  path: "",
  children: &[
    DirEntry::Dir(Dir {
      path: "src",
      children: &[
        DirEntry::File(File {
          path: "src/lib.rs",
          contents: include_bytes!("src/lib.rs"),
        }),
      ],
    }),
    DirEntry::File(File {
      path: "README.md",
      contents: include_bytes!("README.md"),
    }),
  ],
}

When I enable the compression feature I want File to store a once_cell::sync::Lazy in the contents field instead of a &[u8]. That way the data is lazily decompressed when you call File's contents() getter.

1 Like

I don’t have any experience with the rust libraries. But I would probably go with snap myself based on who the crate author is, it’s a native rust implementation and its a fast algorithm.

2 Likes

Have you heard of zstd?

Only problem is, that it isn't pure rust :confused:

Yeah, that's a bit of an issue for me because it makes cross-compiling a lot harder.

1 Like

Not sure if I would recommend it, but you may like to consider my (one and only published so far) crate flate3.
It's RFC 1951 compression and de-compression. I think it works pretty well, I believe it out-performs flate2 and is pure Rust.

1 Like

Another possible option: The brotli crate is implemented in pure safe Rust with very few dependencies. It is developed and used by Dropbox; some development details here.

6 Likes

What is going on with:

https://github.com/dropbox/rust-brotli/blob/master/src/enc/util.rs
https://github.com/dropbox/rust-brotli/blob/master/src/enc/static_dict_lut.rs

If I am reading src/enc correctly, there is more than 2MB worth hard coded data ?

My understanding is that the brotli crate is an almost direct port of their C++ library to Rust so that'll explain why the code looks weird.

I've got no idea what's up with that float64 feature - it feels a lot like the typedefs every C library uses to name their own number types because people don't know about stdint.h. Switching primitives with a feature flag also sounds like a great way to silently break downstream users...

They probably calculated lookup tables ahead of time. You see that stuff all the time where the author is trading space for performance.

3 Likes

Brotli uses a pre-defined dictionary to aid in text compression. (Most similar compression algorithms build such a dictionary on the fly based on the input stream. Using a pre-populated dictionary achieves greater compression in many common cases.)

4 Likes

Quoting that wikipedia article:

Unlike most general purpose compression algorithms, Brotli uses a pre-defined dictionary, roughly 120 KiB in size, in addition to the dynamically populated ("sliding window") dictionary. The pre-defined dictionary contains over 13000 common words, phrases and other substrings derived from a large corpus of text and HTML documents.[7][3] Using a pre-defined dictionary has been shown to increase compression where a file mostly contains commonly used words.[8]

So they took a crawl of the internet, figured out the optimal dictionary for that crawl, then hard coded it into the enc/dec, so that compressed files can use this dictionary without including it as part of the compressed file ? Interesting decision.

1 Like

Well, Brotli was primarily designed for Google to compress its own traffic, so... it makes sense for them.

6 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.