"header/impl" separation trait trick for compilation time?

  1. Rust has a basic rule: impl Some_Trait for C must be done either in the crate Some_Trait is defined or the crate where C is defined.

  2. Suppose now, we define a trait Some_Trait -- now, in crate crate_some_trait_t we must dcone:

pub trait Some_Trait { ... }
impl Some_Trait for ... rust buildin types ... {}
impl Some_Trait for ... structs/enums from external deps ... {}
  1. So now, the compile time of crate_some_trait_t looks like:
parse Some_Trait : milliseconds
parse impl Some_Trait for Rust builtin types: seconds
parse impl Some_Trait for structs/enums from external deps: seconds
code gen: seconds
  1. Now, suppose we have "crate_foo" which depends on having a Rc<&dyn Some_Trait>. In theory, we should only have to wait for the milliseconds for it to parse the def of Some_Trait; but in practice, we have to wait for all the parsing/code_gen time on the rust-builtin-types and the structs/enums from external deps.

Is there some cool trick around this problem?

crate_some_trait: defines pub trait Some_Trait
crate_foo: uses Rc<&dyn Some_Trait>

can we get crate_foo to start compiling w/o waiting for crate_some_trait to do all the parsing on rust-built-in-types and the structs/enums of external deps ?

  1. In C++ terms, what I want is a "header" that defines pub trait Some_trait to be separte from the impl of the traits.

Not really relevant to the question, but the orphan rules are more nuanced than that.

You could make the implementations a feature I suppose... assuming you don't need to coerce from any of the implemented types.

Is the memory layout guaranteed to be consistent ?

If we consider something like struct

pub struct Foo {
  a: u32;
  b: i32; // feature gated
  c: f32
}

it seems like turning a feature on/off would change memory layout; is there something that guarantees everything stays compatible in the context of traits ?

The layout of &dyn Trait and friends? No. The ptr metadata RFC is how you're intended to destruct and construct such things once stable. (Sorry for no link, on mobile.)

I'm struggling to see the relevance though - compiling foo without the feature but then linking it to the trait crate with the feature (in some larger context)? That's not how cargo works at any rate. I don't know how viable it is with direct invocations.

1 Like
crate_foo_t:
  1. pub trait Foo_T {} // few milliesconds to parse
  2.  impl Foo_T for structs / enums from external deps: seconds to parse

crate_a:
  uses Rc<&dyn Foo_T>

crate_b:
  uses Rc<&dyn Foo_T>

crate_final:
  includes foo_t, a, b

Right now, crate_a, crate_b can not start compiling until step (2) is finished from crate_foo_t.

I would like crate_a and crate_b to start compiling after step (1) is finished from crate_foo_t.

On multi core systems, we can parallelize more of the build.

The answer is no, not really.

But also that this doesn't matter as much as you think.

The linear part isn't free, but it's not all that expensive either. Cargo builds are already pipelined such that downstream crates can start compiling before the upstream crate has done codegen -- essentially after the upstream crates have finished cargo check.

That part of the compilation process creates an .rmeta file, which is what essentially acts as the "header file" for Rust.

Additionally, this only matters for an initial clean compile, which while still important, is not the common case. For every compilation after the first (even without incremental, which is a crate-local configuration), the upstream crate's compilation is reused, so you only have to recompile the crate that changed (plus anything downstream from it[1]).


If this is a serious concern, you can switch to a dependency injection scheme instead of direct trait implementation. (Have a strategy parameter with associated types and functions to do the work, and glue crates can define and implement the strategies.) However, this is likely to hurt more than it helps in all but the most extreme cases. The better bet is just utilizing cargo/rustc's built in incremental; it can do better than you can in a lot of cases anyway and is a lot easier to work with than trying to split your code manually.


  1. there's some room for improvement here still; if a crate's .rmeta is unchanged, it's theoretically possible to skip recompiling downstream as well (assuming no/minimal fat inlining) and only having to relink. ↩︎

3 Likes

I'm reading the output of cargo build --timing. Cyan for parsing, purple for codegen. This is currently a bottleneck and killing parallelism for me.

I agree, this is why (2) measures parsing time, not codegen time.

===

At this point, it is a matter of obsession. I have spent the past few days doing nothing but optimizing compile time. For this 45k workspace, I have the cold compile time, after building external deps, in release mode, down to 5-6s.

RELEASE mode. COLD compile. 45 kloc. 5-6 seconds. (external deps precompiled)

Incremental rebuilds are now generally < 1s; sometimes 1-2 s.

One particular technique I have been able to abuse/exploit -- is that this is a solo project, so I have the freedom to shatter/re-arrange all my own crates, to maximize parallelism.

There are a few "blips" in the output of cargo build --timing, CPU usage graph, where if I can slightly parallelize things a bit more, I'm confident I can hit sub 4s COLD compile RELEASE mode build time.

====

If you have good examples of this, I'm interested in seeing it in action. Even if I don't end up using the technique, I'm just curious how far it can push parallelism.

That sounds like success to me. You're under the "stop paying attention and go check twitter" threshold for the cold compile, and it's "conversational" response duration for the incremental.

If you want faster feedback than that, you want IDE integration for as-you-type incremental updates, not command-line batch.

(Personally, I don't think anyone needs a successful compile to take less than 1 second. That's more than fast enough.)

1 Like

It's not that I need it for practical purposes, it's I'm curious how far cargo/rustc builds can be pushed.

Here, for example, is the current CPU usage graph. If I could shave off a 500ms from parsing here and there (parsing only the trait declaration and not the impl), I'm fairly confident this parallelizes to the point it goes sub 4s RELEASE mode COLD build.

This reminds me of the parable of the call centre that was slow picking up calls, so started tracking the number of rings before someone picked up -- and then continued to track and try to improve that average even when it got down to 1.2 rings, which was fast enough that nobody was hanging up any more.

It's obviously bad if it takes 100 rings. But once you're picking up all the calls in under two rings, you don't need to be any faster. It's fine.

2 Likes

At this point, I think your time is better spent improving rustc and looking at what it's doing during the ramp up (timestamp 0s thru 2.1s) on your project.

Given the topic of the thread, I expect that OP has a single (or otherwise small number) of crates which nearly all other workspace crates (potentially indirectly) depend on; they all use the trait definition, but different crates use different "feature sets" (trait impls on upstream types) of the shared crate.

This will bottleneck parallelism on the rmeta stage of compilation of the root crate. And because it's trait impls, it's not simple to separate the root into separate crates to break this bottleneck.

It's probably worth checking cargo depgraph --dedup-transitive-deps to see if you have such a pinch point.

Other possibilities are that cargo is ordering the crate compilation tree suboptimally, or that a number of small initial dependencies are limiting the effectiveness of rustc multiplexing. Either one may potentially be impacted by tweaking the number of jobs cargo will keep in flight at any one time (-jN IIRC).

strategies for crate splitting

Case 0: trait impls

trait DoWork {
    fn do_work(&self);
}

impl DoWork for Upstream { … }

Case 1: newtypes

trait DoWork {
    fn do_work(&self);
}

struct UpstreamWrapper(Upstream);
impl DoWork for UpstreamWrapper { … }

shrinkwraprs or similar can make using newtypes easier.

Case 2: strategies

trait WorkStrategy<T> {
    // &self is optional, but
    // - easier to work with than PhantomData
    // - typically zero cost for monomorphized ZSTs
    // - minimal possible cost: a single pointer argument
    // - more flexible
    //   - strategy can store state (use &mut if possible)
    //   - strategy can be a trait object (if object safe trait)
    fn do_work(&self, data: &T);
}

struct UpstreamWorkStrategy;
impl WorkStrategy<Upstream> for UpstreamWorkStrategy { … }

Using associated types is also extremely powerful, as it can reduce generics sprawl — but it can also be quite noisy when involved, and precludes the use of trait objects without extra trickery.


0s to 1s is basically:

  • traits / structs for a mini-serde impl I have
  • a procedural macro that does #[derive(Mini_Serde)]
  • a number of small crates (raw bindings to web_sys) that can operate without mini_serde / procedural macro

1s to 2s range:

  • 2d/3d/4d points / matrices
  • other low level primitives we need serialization support on

Tne gist of it is:
0s - 1s: traits / structs / procedural macro for "mini serde"
1s - 2s: a bunch of "primitives" that also need "mini serde"

Then after 2s, we get massive parallelization.

Separate to this, there are some crates (easier to use web_sys bindings) that do not need mini-serde and can be built in parallel.

====

Yes. This actually happens more than once too. I'm not able to apply this trick to my "mini-serde" impl; but applying this trick has increased prallelism in later stages.