How to debug rustc memory usage spikes

I have some generated Rust code that makes heavy usage of enums, traits, bitflags, match, etc., and it is still in development so there will be many more variants added to all the enums and match statements. Recently my additions resulted in rustc running out of my memory on my development laptop (linux, having only 16 GiB), and I've resorted to using my desktop which has plenty of capacity for now. What I can observe there is that rustc spikes to 15 GiB once or twice in short succession while compiling just my package alone (after all deps are completed). I'm looking to figure out why that is; ideally I can adapt my code to prevent it from getting quadratically worse as the enums grow larger, or generate a smaller test case for reporting an issue. (Here's the current code.)

I've tried some of the debugging advice I've found around the web, such as https://users.rust-lang.org/t/rustc-memory-usage/55513: RUSTC_LOG=info, /usr/bin/time -v, and -Z time-passes on nightly.

RUSTC_INFO around that time looks like this:

┐rustc_mir_transform::ctfe_limit::run_pass
┘
┐rustc_trait_selection::traits::project::normalize_with_depth_to depth=0, value=context::flags::_::InternalBitFlags
┘
┐rustc_trait_selection::traits::project::normalize_with_depth_to depth=0, value=<context::flags::ContextBits2 as bitflags::__private::PublicFlags>::Internal
├─┐rustc_trait_selection::traits::project::project obligation=Obligation(predicate=AliasTy { substs: [context::flags::ContextBits2], def_id: DefId(20:133 ~ bitflags[3ccf]::traits::PublicFlags::Internal) }, depth=0)
│ ├─┐rustc_trait_selection::traits::project::normalize_with_depth_to depth=1, value=<context::flags::ContextBits2 as bitflags::__private::PublicFlags>
│ ├─┘
├─┘
┘
┐rustc_mir_transform::ctfe_limit::run_pass
┘
┐rustc_mir_transform::ctfe_limit::run_pass
┘
┐rustc_mir_transform::ctfe_limit::run_pass
┘
┐rustc_mir_transform::ctfe_limit::run_pass
┘
┐rustc_mir_transform::ctfe_limit::run_pass
┘
┐rustc_mir_transform::ctfe_limit::run_pass
┘

and then continues with more mir passes afterwards.

With -Z time-passes, the times pause for a while during the memory spike then the next output is:

time:   0.317; rss: 1867MB -> 1893MB (  +26MB)  monomorphization_collector_graph_walk
time:   0.018; rss: 1893MB -> 1893MB (   +0MB)  partition_and_assert_distinct_symbols
time:  25.103; rss: 2498MB -> 1893MB ( -605MB)  generate_crate_metadata
time:   3.186; rss: 1899MB -> 2769MB ( +869MB)  codegen_to_LLVM_IR
time:   3.199; rss: 1893MB -> 2769MB ( +876MB)  codegen_crate

And all I get from /usr/bin/time -v is that it reached 12-15 GiB.

It's hard to really tell for sure what's going on from these. Are there better ways to profile the compiler that can better pin down what's going on, or am I overlooking something in one of these tools?

I've just found that this is what alleviates the issue between dev and release builds:

[profile.dev]
overflow-checks = false

Try reducing the amount of times you call clone, and think about how many simultaneous copies of each data point you need. Maybe Arc could be good here if you need lots of cheap clones

https://github.com/search?q=repo%3AZannick%2Flogic-graph%20clone&type=code

If overflows are causing issues, you could identify where you’re doing arithmetic operations and delegate these to saturating, checked, or wrapping ops instead of panicking ones. Given your discovery about the overflows, you might benefit a lot from looking into where those happen and replace em with something more specific and fault tolerant

Try reducing the amount of times you call clone, and think about how many simultaneous copies of each data point you need. Maybe Arc could be good here if you need lots of cheap clones

Hey, thanks for the advice. I'm not sure I understand how this applies to the overflow-checks or if it's general. Either way, most of these clones are forks; most of the time I'm either checkpointing from a mutable object that will continue being mutated, or forking an object into multiple different objects that will be mutated in different ways (for which I've tried Cow but that has performance costs I don't want).

If overflows are causing issues, you could identify where you’re doing arithmetic operations and delegate these to saturating, checked, or wrapping ops instead of panicking ones. Given your discovery about the overflows, you might benefit a lot from looking into where those happen and replace em with something more specific and fault tolerant

My motivations around compiling and performance in this project is that I want release to run fast and dev to compile fast. Arithmetic overflows themselves are not causing issues as much as some compiler behavior disabled with overflow-checks = false--while I have made mistakes with underflows from subtracting numbers in the wrong order, there should logically be no overflows or underflows so it's okay that these are not saturating/checked/panicking in release. In dev it's nice to catch them, so I won't mind switching to saturating instead of the panicking ones if they're compiled to unchecked in release. But I don't know which ones are causing problems.

Separately, my motivation in this post is to narrow down what the cause of the compiler memory usage is so I can write a better bug report. I'm concerned that it's trait-related and exacerbated by bitflags somehow rather than simply arithmetic overflows. I found similar issues opened recently: Cargo check runs forever and memory usage grows unboundedly ¡ Issue #116914 ¡ rust-lang/rust ¡ GitHub

What is monomorphization? Maybe if we understand it better, then we could know how to fix the problem.

One possible quick and dirty solution to your issue would be to try using fewer bits per ContextFlags object because maybe the combinatorics of 32 options are too many. Does the problem still happen if you use u8 so each ContextFlags macro call is smaller? Maybe that could break the compiler’s work into a larger number of smaller combination spaces and it would work better?

Presumably the issue is here in these macro calls. Maybe you don’t need to shift by 0 for the least significant bit, but that shouldn’t make a difference, but maybe if you hand write out the shifted binaries instead of using the shift operator here, then the compiler wouldn’t need to double check the shifted bits to ensure they don’t overflow u32?

I.E.

bitflags! {
    // Attributes can be applied to flags types
    #[repr(transparent)]
    #[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
    pub struct Flags: u32 {
        const A = 0b00000001; // no shift operation to check for overflow
        const B = 0b00000010;
        const C = 0b00000100;
    }
}

Thanks for the suggestions. I'm pretty sure these aren't the problem; and having now implemented them I can tell you there's no effect. (Well, turning the bitflags structs into u8s reduced the memory usage peak from 33432 MiB to 33302 MiB, but that's practically a margin of error.) I expect the compiler is just constant-folding the 1 << 0 et al anyway so I'm not surprised that changing them into constants didn't affect this. And as for the flags, I'm guessing the way the macro works is that it creates Struct constants that can be |ed together. So if the overflow checks are on the Context struct where the bitflags are used, then it's not the number of constants per struct but the number of them overall.

Is there a way to tell what overflow check uses so much time and memory? My next guess will be to add a bunch more SpotIds only and more bitflags only and see which causes the jump.

Dang, I’m sorry that the last ideas didn’t work!

If the compiler is doing too much work because of the sheer number of bit flags, then what happens if you enable only half of them or can you try turning on each number of bit flags progressively until you hit the number of flags where the compiler problem happens?

If that seems likely, changing some of these into runtime variables could fix the issue by simplifying the constants. Maybe it’s a combinatorial explosion because the compiler thinks of each combination of flags as a different type or something. I don’t know how many flags people use so I don’t know if this is too much.

Focusing on overflows instead of flags, I found some arithmetic, what happens if you use checked_sub etc for these operations later in the file?

Floats don’t have that, sadly, but you could make your own function and check is_finite. Maybe switching to checked/saturating ops will fix it.

I'm not really sure what you're referring to about turning on or off the bitflags. They aren't being monomorphized exponentially as far as I know--basically I'm defining some constants that are going to be used as data--instead of defining 32 bools in my struct I can define a bitflags object with 32 flags, and it's basically just a u32. It's not easy for me to remove their definitions so much anymore because of their usage throughout. But what I can do is add more stuff or change some stuff... a) add more bitflags, b) move some flags out of the bitflags (tracking the information in an i8 instead of in a bit, so it actually is more space in the Context), c) add more enum variants. (Edit: Having done these tests quickly just now, adding more enum variants to SpotId results in nontrivial impact, so I should probably look into the EnumMap usage and match statements.)

The checked/saturating stuff in spend and collect is an interesting idea. I will have to try that (though that should only affect things I store as integers).

Hmm, maybe you’re right, alright let’s keep digging. Maybe moving some of this stuff to runtime data vs compile time code could simplify, is all.

Could it be here? Expectation is a massive enum and this function processes vectors of them. Perhaps if you make this function process only one expectation at a time, then it could compile to handle that, and a separate function could map the handle-one-expectation over a vector. That might remove the possibility of nonlinear interactions between expectations.

Another idea could be to split the Expectation enum into two chunks because you have a dividing comment in the middle, maybe it’s just too big.

Since you mentioned SpotId, I did a search, and found this:

I think you found the issue, this enum might just be way too huge and the compiler is screaming for mercy because every time you check what location you’re in you have to pattern match against every possible location in the game.

What if you did a heroic refactor to convert this from classification (all the SpotIds in one big enum) into categorization (give it some levels of nesting so that you can focus on the local spots within each region without worrying about distant SpotIds?

You could have AmagiSpots, AntarcticaSpots, EbihSpots, etc etc and thereby each decision would be much less of a combinatorial explosion because you would never need to think about the spots from a different region. I’m not sure how that leads to “overflow” could it mean memory overflow instead of numerical overflow?

It’s possible the spot ids here are interacting exponentially and the compiler internally has some number type which is overflowing just because of the sheer number of SpotId pairs. Are there any places where you pattern match on multiple SpotIds at the same time?

Anyway maybe what you need to do is figure out how to break SpotId into a decision tree so you can choose a region, choose a location, choose a spot, to reduce the number of variants you’re matching at any one time.

pub struct Identifier<T>(Arc<str>, PhantomData<T>);
pub trait Id {
	fn id(&self) -> Identifier<Self>;
}
pub trait Region: Id {
	type Zones;
	fn zones(&self) -> Self::Zones;
}
pub trait Zone: Id {
	type Region: Region;
	type Spots;
	fn region(&self)-> Self::Region;
	fn spots(&self) -> Self::Spots;
}
pub trait Spot: Id {
	// maybe unnecessary to put the bound here
	type Region: Region<Self = Zone::Region>;
	type Zone: Zone<Region = Self::Region>;
	fn region(&self)-> Self::Region;
	fn zone(&self) -> Self::Zone;
}

pub struct Spot<R, Z, S> 
where
	R: Region,
	Z: Zone<Region = R>
	S: Spot<Zone = Z>
{
	region: R,
	zone: Z,
	spot: S
}

You probably don’t need to switch to traits like that but maybe it would be more scalable in the future if you want to have more and more regions, zones, and spots, then not needing to know all of them at compile time might help you have more freedom to update stuff without needing long compile times.

Yes, in movements.rs there are a handful of functions that match on pairs; the idea is to let the compiler build a table how it sees fit rather than generate a compact table myself. Of course it is going to be sparse though. I did once experiment by replacing entire functions with a constant and I didn't see an effect from it.

As for restructuring the enum, that's an interesting idea I'll have to think about. I'd rather not have too much indirection for performance reasons, and part of the benefit of using C-like enums is compactness with rmp and ease of use of EnumMap to make tables.

The next things I tried:

  • checked ops and saturating ops: No effect. I also looked at these in godbolt to get an idea of performance and it seems like checked ops would win on instruction count. I might actually use that just to make sure there aren't underflows, but realistically it's as I expected here too.
  • Removing the entirety of Expectation: I replaced the associated type where needed with PhantomData. No effect.

Looking at data sizes now, Context=200 bytes, ContextWrapper=232, and World=96584. There are 836 entries in SpotId. There are some functions in analyzer that will create an EnumMap<SpotId ContextWrapper>, which should be about 193,952 bytes. The largest match is a movement function that matches on ([bool; 1], SpotId, SpotId) and returns a u32, so if that generates a table it should be about 44,729,344 (~43 MiB). And I assume that table could be generated even in the optimized build. None of those seem to be such an outsize amount that would result in a peak 35 GiB during compilation (though I know I should fix them somehow). (Yes, that's more than my original post because of the stuff I've added since.)

The reason I was looking into the issue I linked was my use of interconnected associated types in traits; in that issue the compiler goes into an infinite loop, but my circular associations do resolve. Someone mentioned that trait impls do check for whether an overflow would occur ("[in case 1] it correctly determines that the Filter impl overflows trying to check if T already implements Filter"). I can't tell for sure whether it's related to that, because I have types in that web that grow based on the number of SpotIds I have, and I still don't understand what it's allocating.

Is there a way I can debug the compiler? Interrupt it and get a dump? Get better log output?

1 Like

I created a branch that uses nested enums instead of one giant enum, and compiler memory usage dropped back to 15 GiB. It's harder to return slices this way, so there's a bit of a performance loss in a few places that I think has to do more with constructing vecs (impls in trait method returns not being stable yet).

I don't think I have a good way to turn the enums into structs that deal with the data at runtime instead of at compile-time. But since I do have a workaround that doesn't hurt, and which is enabled by default for release mode, I'm going to move on for now with the original version until I get more insight into the problem or find better alternative implementations.

How goes it? Try “valgrind --tool=massif rustc -- your_compile_options” and “heaptrack rustc -- your_compile_options” and “rustc -Z time-passes” and “rustc -z self-profile” and maybe even “cargo bloat” if you have many dependencies!

https://wiki.alopex.li/WhereRustcSpendsItsTime

2 Likes

I had to replace my SSD over the weekend so I haven't had time to try out your suggestions. But at this point, apparently I've added enough stuff that I can no longer compile with 48 GiB of memory provided to WSL.

Try “valgrind --tool=massif rustc -- your_compile_options” and “heaptrack rustc -- your_compile_options” and “rustc -Z time-passes” and “rustc -z self-profile” and maybe even “cargo bloat” if you have many dependencies!

I've previously tried most of these: massif didn't give me good feedback; time-passes was mentioned in the OP; and self-profile was interesting to look through but nothing stood out. heaptrack seems interesting, but heaptrack cargo rustc [OPTIONS] measures the tiny amount of time cargo itself runs; running rustc separately seems more esoteric as I haven't figured out the right invocation to build my crate... I'll get it from cargo rustc -v I guess.

Ok, heaptrack seems to need me to build rustc in debug mode, cause it got nothing:

total runtime: 148.509000s.
calls to allocation functions: 0 (0/s)
temporary memory allocations: 0 (0/s)
peak heap memory consumption: 0B
peak RSS (including heaptrack overhead): 49.58G
total memory leaked: 0B

I'll get back to you on how that goes.

1 Like

Huzzah! Building the compiler and running it through heaptrack seems to be the way to go. Here's what I did:

  1. Built rustc with debug options enabled per instructions.
  2. Installed heaptrack and heaptrack-gui.
  3. cargo +stage1 rustc -p axiom_verge2 --lib allowing it to build all dependencies.
  4. cargo +stage1 rustc -p axiom_verge2 --lib -v to get the full rustc commandline.
  5. cargo +stage1 clean -p axiom_verge2 to ensure starting from scratch.
  6. heaptrack [RUSTC_COMMAND]

Here's a great image that heaptrack produced:

I've also got a stacktrace of the largest allocations so my debugging objective is complete with this. Thanks so much for your help!

4 Likes

Very nice, thanks for documenting your process!

Did you draw any conclusions about what is causing the memory usage spike or how to reduce it?

Yeah, the stacktrace helped me file a bug against Rust and I included the information heaptrack gave me which pointed at MIR dataflow. Someone with familiarity looked into it and found some fairly extreme expanded code that appeared to be coming from enum-map, and when I went looking for the source, I found a relevant bug: #104 - enum-map-derive produces needlessly large amount of code in unit field case - enum-map - Codeberg.org

In short, #[derive(enum_map::Enum)] generates code like 0usize+1+...+1. Debug builds do not do constant-folding by default and they also enable runtime overflow checks on each +. Together, these defaults result in very large amounts of memory usage. ("MIR validation has memory usage which is O(blocks * locals).")

I don't have an answer on how to reduce it on my end... the issue is pervasive enough that my refactoring of the large enum wasn't super helpful. In the end I'll keep my workaround (overflow-checks = false) until enum-map-derive is fixed.

4 Likes