5 hours to compile macro, what can I do?

I have a very large macro instantiation, which takes 4 or 5 hours to compile with Rust 1.40, so I'm looking to improve this unreasonably long build time.

The macro wasn't always this slow to build, it seems to have gotten slower with newer versions of Rust, although I haven't pinpointed the exact point it regressed, it was always still slower than I would have liked (I recall it was around 30 minutes in earlier versions, still uncomfortably long for new contributors). I have avoided modifying this file as long as I can to avoid the long recompilation on my system, but I plan to expand this macro in the near future, so it is time to finally tackle this compilation problem.

What can I do to reduce the multi-hour compile times for this macro?

Here is the file with the macro in question, for reference:

The big macro is define_blocks.

Here's what I have tried so far:

1. Reduce the size of the macro instantiation
If I remove all but two of the "blocks" (definitions which the macro matches against), then a release build completes in less than a second. So the compile time is definitely correlated with the size of what I pass to the macro. I haven't bothered to gather more data points (number of blocks x time to build) and graph the relationship (exponential?), but it is clearly related.

This is the first thing I tried, confirming the macro is the cause of the slowdown. But I actually need to pass all those definitions to the macro, because that's what it is for.

I suppose I could rewrite this code by hand to not use macros. This could be one potential fix for the problem but I have a lot of functionality invested into this macro and would like to keep for now it if possible. Which brings me to:

2. Enable macro expansion tracing
Using the nightly trace_macros feature, with trace_macros!(true) before define_blocks!().

Observing the trace output, the macro expansion completes in about 9 seconds. Not hours, so the problem must be elsewhere.

3. Profile the compiler
Building with the -Z time-passes rustc compiler flag described in Guide to Rustc Development. Using the same optimization level as in the default release profile:

blocks $ cargo clean ; time cargo rustc -- -C 'opt-level=3' -Z time-passes

Completed in 5 hours 9 minutes. The end of the output shows what is taking so long:

   time: 0.002	codegen passes [4gqa868n29dtlfbb]
    time: 0.652	LTO passes
    time: 0.075	codegen passes [473etbgnzq2x8af8]
    time: 3.476	LTO passes
    time: 4.218	LTO passes
    time: 4.255	codegen passes [22fyvf1uiha0mrri]
    time: 4.234	codegen passes [6ffwyc85r8r5jgh]
    time: 63.064	LTO passes
    time: 15574.417	codegen passes [2w3r8ppd44eh9kz5]
  time: 15707.329	LLVM passes
  time: 0.003	serialize work products
  time: 0.746	linking
time: 15771.815		total
    Finished dev [unoptimized + debuginfo] target(s) in 263m 19s

real	309m45.295s
user	198m44.124s
sys	14m54.290s

15574.417 seconds spent in "codegen passes" from LLVM. This sounds optimization-related.

4. Compare optimization levels
No optimization, finishes in less than 2 minutes:

cargo clean ; time cargo rustc -- -C 'opt-level=0' -Z time-passes
    Finished dev [unoptimized + debuginfo] target(s) in 1m 41s

Basic optimization, only 2 minutes:

cargo clean ; time cargo rustc -- -C 'opt-level=1' -Z time-passes
    Finished dev [unoptimized + debuginfo] target(s) in 2m 05s

"Some" optimizations (opt-level=2) ran for 3.7+ hours and consumed 20 GB of memory before I killed the process, so I estimate it takes on the same order of magnitude as full optimizations (opt-level=3) as in the release build (5.15 hours). Whatever is causing the slowdown occurs with the optimization passes enabled at optimization level 2+.

5. Count lines of LLVM IR generated
Found this handy cargo llvm-lines tool here: Improve compile time and executable size by counting lines of LLVM IR

When I run it on my project (in an debug build so it doesn't take forever), the top functions are:

   Compiling steven_blocks v0.0.1 (steven/blocks)
    Finished dev [unoptimized + debuginfo] target(s) in 1m 19s
 151828    2  core::ops::function::FnOnce::call_once
  43954  488  core::option::Option<T>::map
  24252  896  core::ptr::real_drop_in_place
  19669    1  steven_blocks::Block::get_model
  15834   39  alloc::raw_vec::RawVec<T,A>::reserve_internal
  14674    1  steven_blocks::Block::get_flat_offset
  12548    1  steven_blocks::Block::get_hierarchical_data
  12362    1  steven_blocks::Block::get_collision_boxes
  11195    1  steven_blocks::Block::get_model_variant
   9678    1  <steven_blocks::Block as core::fmt::Debug>::fmt

The getters in Block are all macro-generated code that matches on *self to insert arbitrary expressions specific to the "block", for example this is get_model:

            #[allow(unused_variables)]
            pub fn get_model(&self) -> (String, String) {
                match *self {
                    $(
                        Block::$name {
                            $($fname,)?
                        } => {
                            let parts = $model;
                            (String::from(parts.0), String::from(parts.1))
                        }
                    )+
                }
            }

19669 is a lot of lines of IR, but I don't think this is incorrect. There really are that many expressions (currently 300, one for each block definition), each of varying complexity.

A possible angle of attack could be to try to rewrite the code with the goal of reducing the number of IR lines generated. Open to suggestions of how to do this. One idea is to replace the huge match generated code with a data structure, such as a HashMap. Mapping the keys should be straightforward (the block identifiers), but the values are arbitrary expressions dependent on other variables defined in the block definition. Which is why this was all generated by a macro in the first place. Would like to preserve this macro as-is if I can.

6. Reduce optimization level in sub-crate manifest
As the crate compiles 153x faster with basic optimizations, I thought I would settle for reducing the optimization level from the default full optimizations for the release profile, by adding this to blocks/Cargo.toml:

[profile.release]
opt-level = 1

This worked... when I tested the crate by itself. However, the crate is nested in another project, and it turns out the manifest profile configuration is actually ignored except at the top-level, as documented in The Cargo Book:

Cargo supports custom configuration of how rustc is invoked through profiles at the top level. Any manifest may declare a profile, but only the top level package’s profiles are actually read. All dependencies’ profiles will be overridden. This is done so the top-level package has control over how its dependencies are compiled.

If I reduce the optimization level at the top-level package, then building is fast as expected, but I want to optimize everything except this blocks-macro package as much as possible (in fact my program won't run properly without full optimizations, it is too slow). Only the crate with the gigantic define_blocks! macro builds slowly, so that's all I need to reduce the optimization level for, but it fails due to this rule.

7. Reduce optimization level in sub-crate config profile
Can dependencies have their own separate opt-level? Not yet, but config profiles should allow this in nightly, and as I understand it, profile-overrides will be in Rust 1.41.

First attempt, I placed this file in blocks/.cargo/config:

[profile.release]
opt-level = 1

and executed with cargo +nightly build --release -Z config-profile. But using the profile-overrides feature should work come 1.41 and it does work well in nightly, by adding this to the top-level Cargo.toml:

cargo-features = ["profile-overrides"]
...
[profile.release.package.steven_blocks]
opt-level = 1

Finally, I have a decent workaround. I'll have to switch from stable to nightly, for the time being, and sacrifice the optimizations for this crate, but at least I can develop and iterate quickly.

Yet I am still left wondering, what exactly in LLVM optimization could be this slow? Is there anyway I can dig deeper into the 4 hour "codegen passes" phase? Is it possible to identify and disable or speedup the problematic optimization?

7 Likes

LLVM is not good at big functions, I have seen this before as well. One of the functions you are generating is almost 100,000 lines of Rust code including tons of internal control flow: starting here, repeated 300Ă— in the definition of VANILLA_ID_MAP. You should reorganize define_blocks to break this one function up into 300 individual functions and it will compile in minutes.

15 Likes

Perfect, thank you! Was hoping there was a straightforward refactoring possible to address this, and indeed breaking up the gigantic function covering all blocks, into multiple functions (one per block), addresses the issue.

In case anyone else runs across a similar problem, to summarize, the pathologically slow macro implementation was something like this:

        lazy_static! {
            static ref VANILLA_ID_MAP: VanillaIDMap = {
                $({
                    /* large amount of code for $name */
                })+
            };

This creates one huge function containing repeated code for each block, as seen with trace_macros ({...} {...}... ad infinitum). So I moved each block registration into their own function, in a new submodule, then call them all in lazy_static:

        mod block_registration_functions {
            use super::*;
            $(
                pub fn $name(...) {
                    /* large amount of code for $name */
                }
            )+
        }

        lazy_static! {
            static ref VANILLA_ID_MAP: VanillaIDMap = {
                $(
                    block_registration_functions::$name(...)
                )+
            };

(The complete code change: https://github.com/iceiix/stevenarella/pull/267/files#diff-ded23d9a49a226cef06ee7fe5e00ebb2)

This simple change allows the crate to build not in 5 hours, but 40 seconds! (450x as fast)

10 Likes