Why are most crates > 1000 LOC?

If:

  1. Rust slow compile time is a big problem.
  2. Parsing / type checking crates can not be parallelized.

Then:

Why are most projects not split up in a way such that most crates are < 1000 LOC ? That seems to be the sweet spot for paralleling builds.

2 Likes
  • Rust's compile time problems are wildly overblown, just like some people like to describe the borrow checker as some obscure dark art, when in the real world that's something you barely think about 99% of the time. Rust is no slower than other compiled languages of similar complexity (Typescript, C++, Scala) and can easily be faster, given some luck and effort.
  • Parsing is a negligible portion of compile times. Typechecking is more noticeable, but also quite small. The biggest offenders are proc macros, codegen, linking, and some rare cases of super complex trait inference (like in typenum). Typechecking Rust is Turing-complete, so its timing is theoretically unbounded, but in practice it matters only when you write types and traits which make your head spin.
  • Linking is worst case quadratic, so using many small crates (and thus many object files) can significantly increase compile times. Using a fast linker really helps.
  • Codegen and optimizations take a lot of time, and having a large public API surface of a crate, with many generic functions, significantly slows down codegen (the same functions must be recompiled over and over with different type parameters). Small crates also inhibit optimizations (cross-crate optimizations are much less likely to happen), unless you enable LTO (link-time optimizations). LTO is a major compile time drain of its own.
  • The 1000 LoC suggestion is pretty much out of the blue. There is no reason why this would be better or worse than any other size. Split your crates on sensible API boundaries.
  • For that point, maintaining many small crates is a bigger pain, and depending on many tiny crates is also much more work, if you actually try to ensure quality of your dependencies. Every separate moving part is an extra potential point of failure. You don't want to have more moving parts than really necessary.
  • For plenty of crates, there is just no sensible way to split them into tiny crates. Take a look at syn, for example. It gives you a Rust parser. It's huge. How do you split it into parts? A separate crate for function parameters, and a separate crate for where-clauses? That doesn't make any sense. For that matter, tens of thousands of LoC is a much better estimate for the size of a reasonably complex crate.
35 Likes

1000 is not a made up number. I sent nearly a week staring at the output of cargo build --timings; and from looking at the cyan (non parallelizable) vs purple (parallelizable) portions, 1000 LOC seems to really speed things up.

Something else that jumps out to me is that Rust's compile times aren't slow for no reason. Rust offers something relatively unique; the potential to program using high-level constructs with a compiler that can see through those and generate highly optimized low-level machine code. But that analysis the compiler does has a cost.

Of the set of { high level source, high performance results, fast compile times } of course it's possible to choose a different two; you could program in assembly and your trade-off is the high level source, or you could program in a language like Python where your trade-off is the runtime performance. I'm enthusiastic about Rust because I respect this exact prioritization.

One of the reasons many smaller crates improves compile time is because it breaks the code into smaller chunks for the optimizer to consider at a time - parallelism is a big win here, but it comes at the cost of narrowing the scope of optimization. 1000 lines sounds to me like it would correspond to between 10 and 100 functions per-crate. That sounds (to my personal sensibilities) like an extremely narrow scope for the optimizer to work with.

It seems like you care a lot about compile time and that's a reasonable position to have, but on the other hand people with priorities like mine would I think be quite disappointed if a majority of ecosystem crates went in this direction and it resulted in measurably worse runtime performance due to missed optimization opportunities.

10 Likes

The crates you're compiling are not necessarily representative of all possible kinds of crates though. Moreover how did you even counted those 1000 LOC? Macro can generate a lot of heavy code in just 1000 LOC for example.

7 Likes

Using cloc, on the pre-macro-expanded code.

I have 2-3 handwritten declarative macros for slight improvement over panic!and one procedural macro that does mini-serde.

1 Like

As a concrete example, the either crate is implemented in a single file of 1471 lines, of which the last 160 are tests. The exact LOC metric for this will depend on your counting methods, but it will be in the general ballpark of 1000.

From a developer’s perspective, either feels like an extremely tiny crate: It provides the definition of one relatively trivial enum and a suite of trait implementations for it. If the crate didn’t exist, or wasn’t as popular, I’d think nothing of writing my own project-specific Either type. It would be much smaller, though, as I would only bother implementing the traits that I needed in my project.

This sort of smaller Either type wouldn’t fare very well on crates.io, though: It would be too unlikely to have the trait implementations necessary for your project. So it either hangs around as a utility crate serving one other crate, or it grows all of these other trait implementations to be generally useful. This is a common pattern: “productionizing” a simple, useful utility often requires a significant LOC increase in order to cover enough use cases for the community to take notice.

11 Likes

I agree, maybe 1000 LOC / crate is too small for production code, and it should be something like maybe 3000-5000.

As another concrete example, consider web_sys. If I was designing this, I would probably want it to be one crate per FEATURE; then have an "overall" crate that includes all the tiny micro crates. I suspect something like that would massively parallelize the build time for web_sys.

I will absolutely concede to you the point that maybe 1000 is not the right numeric constant, but I feel the general direction of shattering larger crates into smaller crates would drastically speed up build times. Especially since CPUs have been on the more cores rather than more-hertz trajectory lately.

Maybe I don't have enough cores (4), but for me most of the time each core is 100% busy when compiling. It looks parallelized enough.

One reason not to split crates too much: sometimes I want to check how something is implemented, so I click "source" in the doc. Then: "oh no, it has a inner". Hopefully this inner type is defined just above in the same file. But sometimes it's hidden behind a macro or a use *. So I have to check the transitive dependencies docs, or even go to their repositories, to find what is this inner type.

9 Likes

Current machine is 28 core / 56 threads. Most cores are idling when I cargo build. I think 100+ core 200+ thread machines can be had for < $5k right now.

IntelliJ / goto-definition generally works fine across multiple crates.

In my case, it is because making a crate has significant overhead in terms of more files.

If making a crate was as easy as making a "mod", I suspect lots of crates would be split into many tinier pieces. I have seen a recent blog post which discussed this, but I now can't find it.

3 Likes

That is more an issue of the dependency graph rather than crate size. A many-core machine chews through all the leaf crates really fast. After that you quickly get into the situation where things depend on build scripts, proc macros, dylibs or have long dependency chains.
Macro expansion happens fairly early, so other crates can't even complete the parsing stage before macros from their dependencies are available.

There are a few crates where the serial parts of the compiler are truly expensive, e.g. libcore which includes a gazillion lines via core_arch (simd intrinsics). But often it's the other stuff that's lengthening the critical path in the build graph.

9 Likes

Is Yoshua Wuyts' Inline Crates the blog post you were thinking of? The suggestion in that post is to have a crate keyword to define "inline crates", akin to the mod keyword for defining inline modules.

1 Like

This blog post says something disturbing to me: this is only about rustc specifically - the LLVM part of the compilation may parallelize individual crates.

I kinda was sure that it's C++ side, LLVM, which makes it impossible to make rustc non-singlethreaded.

The work which takes a lot of time in rustc without optimizations differ for different code, but I suspect that it's all in highly parallelizable code for most crates: typechecking, type inference, lifetimes…

Why can't rustc just use all the cores? It's written in a language which boldly proclaims that it provides fearless concurrency, yet it can not use more than one thread? WTH?

1 Like

Tokio previously consisted of quite a few crates, but this split was removed because users found it confusing.

4 Likes

Amdahls Law is one reason.
In context it means roughly that there is a hard limit to what can be parallelized. Type checking is a good example of the dependent nature of some computations: if you want to check that an argument to a fn is well-typed, you need the definition of the type of that argument first. That is, it involves serial steps that simply cannot be parallelized at all.

Another is the fact that rustc is probably the Rust project with the most legacy code. I mean that both in the sense that such code should be replaced, and in the sense that it's old and thus probably still uses idioms from pre-1.0 here and there.
This slows down adoption of parallellization because the code simply wasn't written with massive parallellization in mind. That tracks, because most of that code is already highly complex in its singlethreaded form.

Note that fearless concurrency doesn't mean easy or brainless concurrency. It still requires effort to actually include it in your code, if only by making sure your code can actually make use of it and prepare it for that, e.g. by using iterators so that the code can switch to parallel iterators.

12 Likes

There's no "WTH" there. It's a simple fact of life that not every computation is parallelizable.

Others have already explained this issue in more detail above; the essence of the problem is that compilation of crates that depend on each other can't be parallelized, pretty much by the definition of dependence. If you have 8 crates, each of which uses another other one in a chain, then you won't be able to compile them in parallel even if you have 8 cores.

And before you ask: yes, parsing could probably be done in parallel, because the syntax tree is independent of crates (but even that is only true if they don't use macros from other crates). However, with further steps, basically all bets are off. You have to wait for at least semantic analysis of the dependencies in order to perform the semantic analysis of the dependent crates, and since codegen also depends on semantic analysis, that has to wait too, albeit it could be pipelined at least (after semanal of dependency, codegen dependency in parallel with semanal of dependent). However, this doesn't help, because: 1. parsing completes in a fraction of the time compared to semantic analysis, codegen, and linking, and 2. linking doesn't parallelize for the most part, either.

So what can reasonably be done is to compute a dependency DAG at the crate level, and then compile only the crates in parallel that don't depend on each other. And this in fact is already implemented, so the Rust build infrastructure doesn't miss any obvious parallelization opportunities by reasonable standards.

10 Likes

Linking can actually be parallelized pretty well. Parallel linkers like LLD and more recently Mold have managed to see some pretty impressive performance gains by parallelizing what they can. Mold even manages to get close to full saturation most of the time. It does usually require a bit of configuration to get rustc to use one of them as opposed to the system's linker, though.

6 Likes

I do agree adding more crates has been a pain, not so much the "cargo new --lib", but then adding the path to the workspace and adding the path of dep crates.

I do agree that "tiny crates" is not enough, and what we need is "tiny crates + shallow tree".

In my problem, "long dependency chains" is a solvable problem because:

  1. I do not modify external dependencies, so their contribution to incremental compile time is 0.

  2. For code in my own workspace, I can generally re arrange things to minimize dependency chains.