Unmanageable compile time with long / lots of generated code

I'm building a system that relies a lot on compile-time code generation, via the syn and quote crates. The compile-time has become unmanageable when the amount of generated code is very large (I've seen up to ~2-3 hours).

The bottleneck appears to be release mode optimizations.

  • All logic in our macro crate (i.e., all code that I wrote) finishes relatively quickly, so whatever is taking a long time is occurring after this.
  • In debug mode, compile time for the same program is generally much lower.
  • When the code generation step is removed (i.e., paste output of cargo expand and compile from there), the compile time is sometimes much faster. This does appear to be inconsistent.

I considered the impact of "long" functions, branching, and/or function calls (for the latter -- explicitly inlining more things did seem to help). I did some research on build customizations and attributes, but parsing through these went beyond my baseline understanding of the Rust compiler.

I would appreciate any thoughts on what could be going on here, recommended resources, and/or what I should look into. Thank you so much!

You might be interested in tools like GitHub - dtolnay/cargo-llvm-lines: Count lines of LLVM IR per generic function -- maybe you're instantiating more than you really need.

And in order to reduce the amount of such monomorphized code, you can try to move code out of generic functions into non-generic functions, when the code doesn't actually depend on the generic parameter. This won't help much if the function in question gets inlined, but when it isn't, it can prevent the same function going through the codegen phase (including optimization) more than once unnecessarily.


But, I think the first thing you should check to decide what to pursue is cargo build --timings. Provided that the relevant crate is a library, not a binary, this will report the time the compiler spends in two different phases:

  1. parsing, macro expansion, type checking, trait solving, borrow checking, and constant expression evaluation
  2. optimization and machine code generation

Since you say that optimization seems to make a big difference, most likely phase 2 is much longer, but it can't hurt to check

A finer-grained breakdown can be obtained from cargo +nightly rustc -- -Ztime-passes, but I don't know much about how to usefully interpret that.


Some other things to try:

  • Change the release profile to use opt-level = 2 instead of the default opt-level = 3. I have heard that this may be faster to compile without producing much worse results. (You can do this on a per-package basis, if you like.)
  • If you've been using #[inline], try not doing that. Maybe even try adding some #[inline(never)].
  • Break up your longest function into multiple functions.
  • Break up the crate into multiple crates. This can help even if one depends on the other, if the codegen phase is taking up most of the time.

Some of the things that the compiler has to do require worse-than-O(n log n) algorithms, so if you can split the work into separate parts, it may complete faster overall despite the overhead of having more parts.

2 Likes

Okay, a few months ago, I had a problem with excessive cross compilation time. A lot less complicated than your compiler macros, but still ended up with 2h+ per binary with the only problem being that, well, I had to write a lot more code than for just one binary over the next few months...

Also, I cross compile to 2 different platforms, so in total each binary is build 3 times, one for the CI host for testing and one for each platform. The previous setup relying on Cargo and Github was a non-starter.

After some back and forth, here is what I did:

  1. Migrated my repo from Cargo to Bazel
  2. Separated large crates into smaller crates since Bazel hashes per crate
  3. Added remote cache, that is, compiled artifacts are stored in a remote cache
    and just loaded from the cache in case nothing has changed.
  4. Added Remote Build execution that compiles the entire project on a cluster with 80 CPU cores. Thanks to BuildBuddy.com, this with a surprisingly easy.
  5. Re-designed the dependency graph so that each binary follows a path through the dependency graph that is disjoint from all other pathways

I cannot stress enough how much of a life saver BuildBuddy has become over time because, while Bazel is already quite fast, with proper remote build and remote cache, it just flies at the speed of light.

My mono-repo vendors all dependencies so in total we're talking about 400mb of source code with my own code being close to a hundred crates so it's a lot. Not the biggest in the world, but already enough that you don't even want to do full rebuilds with Cargo anymore.

These days with Bazel and BuildBuddy, a full rebuild i.e. when a new Rust version comes out, takes about 8 minutes and an incremental build is usually done within two minutes. In both cases, the build protocol is compile, unit test, integration test, container build, and container publish. And all binaries are build with opt level 3 for two different platforms...
The speedup comes largely from parallel execution, which comes from the disjoint dependency pathways mentioned earlier. And Bazel / BuildBuddy caches reliably everything, so that is a big one too.

In my experience, when you can chunk your larger crates that take the longest to compile into a larger number of small crates and design your dependency graph so that each output artifact runs through a disjoint path, then you may see a substantial speedup even with Cargo. The build tool is a lot less relevant to total compile time compared to the real impact of a poorly designed dependency graph.

This idea goes both ways: A slow dependency graph is basically a linear ordered sequence of gigantic crates where the first one blocks all others. Obviously, this build takes hours.

A fast dependency graph is one where each output target has its own path through the graph with little to no intersection with the other pathways. This is critical, because when you change something in pathway A, no other pathway and no other output target gets build and tested, and less compile is always faster compile. The other big idea is that a multi-path build graph is just trivial to parallelize since order between pathways doesn't matter.

Also, I read a while ago that the Cargo / Compiler team actually works on parallel compilation, but so far only on Linux and you have to enable it with a flag. Do some online search because that may help you.

I understand that a lot of people don't like Bazel for any number of reasons, and that's okay, but the observation that a lot of small crates build a lot faster in parallel than a small number of big crates that usually can't parallelize remains true. With some proper re-design of your dependency graph, you may get a meaningful speedup with Cargo without doing much other than some targeted code refactoring.

4 Likes