Soft question: scaling codebase 50k loc -> 500k loc

anon80458984 · December 19, 2023, 2:22am

This is not a "right answer" question. I am primarily looking for recommendations for books / resources / blog posts. There will likely be followup questions.

I'm primarily interested in Rust resources, but given the nature of the topic, I'm open to generic / non-Rust advice too.

here's the issue: Back when I was dealing with dynamically-typed, no static typing languages, there was this productivity barrier that I hit at around 10k lines of code. Refactoring becomes a drudge. I rename something, I don't find all refs, something else becomes nul/nil/undefined, and I get a runtime error. Because of this, I am reluctant to refactor. Bad design accumulates. And dev time slows down.

With Rust/IntelliJ, renaming-refactoring is no longer an issue. IntelliJ resolves all refs (outside of macros), and those IntelliJ misses, the static type checker catches. However, at around 50k-100k loc, I am still running into a "refactoring becomes a drudge; shitty design accumultes" problem, largely due to "everything knowing too much about each other."

I'm looking for concrete advice from those who have scaled codebases from 50k-100k LOC to the 500kloc+. In particular, I am curious about things like:

how do you architect large codebases with "components" that are easy to replace / refactor ? One approach I have been thinking about w/ regards to this is Programming against traits in Rust - #19 by zeroexcuses
how do you architect large codebases that are easy/productive to develop against ? (one measure is: minimizing the # of concepts the coder has to keep in mind; another is components with "predictable" behaviour, etc ...)
One thing I'm really drawn towards is:

"Show me your flowchart and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won't usually need your flowchart; it'll be obvious." -- Fred Brooks, The Mythical Man Month (1975)

This includes Entity-Component-System design. Even outside the benefits of cache locality, I like the way how there is a central store for state (Entity-Component), and update rules are coded up in Systems (which feels like loose coupling and allows independent refactoring / updates).

firebits.io · December 19, 2023, 3:02am

You might be interested in the hexagonal or clean architecture since you asked about programming against traits.

The key point is that all of your modules have to be loosely coupled, and only passed as dependencies rather than creating instances here and there.

kpreid · December 19, 2023, 4:57am

how do you architect large codebases with "components" that are easy to replace / refactor ?

Develop the habit of, when you create some interface/API/abstraction boundary (whether it's a trait or a module with some functions, a wire protocol, a file format), thinking about what you're choosing — what it makes easy, what it makes hard, how it is extensible and inextensible. Then you will have a better chance of noticing when you make a choice that will restrict future changes.
Remember that not all such restrictions are to be avoided — sometimes it's better to have something simple that can be thrown out, rather than complex and highly abstracted.
Approach large changes incrementally; when you discover a need, refactor until the replacement is easy, then do it, and finally refactor to clean up the leftovers of the old design.

how do you architect large codebases that are easy/productive to develop against ?

Make it easy to learn by doing, by all the little things that avoid speedbumps in the development experience.

Fast compile times for small changes. Besides general compilation perf advice, think about your dependency graph and minimize how much development of crate A then requires building and testing dependents A→B→C→D.
Tests that are fast, don't have weird requirements to run, and are easy to modify and expand.
Well-documented (and well-named) modules (and types and crates) that explain how they should be used, so that working on a particular module doesn't demand existing familiarity with the modules it uses, but merely consulting the docs.

anon80458984 · December 19, 2023, 10:05am

At the risk of being accused of "changing the question", one thing I am starting to get convinced of is wanting to push my code towards the RDBMS side of Object–relational impedance mismatch - Wikipedia

So in particular, my current Rust code is a bunch of Struct / Enum / Traits, having Rc<...> ptrs to each other -- forming some giant "Object Graph", (here, Object used loosely to refer to Struct / Enum / Traits).

Instead, I am looking for ways to structure code (without embedding sqlite) in a more "RDBMS like manner" -- there is some centralized data store w/ "tables" -- and other code transactionally query / update the centralized data store.

If there are rust projects that are using this type of architecture, I am interested in reading more.

matklad · December 19, 2023, 11:03am

Some notes on that, from experience of approaching (but never exceeding) 0.5mloc:

Maybe some ultra high-level pithy things to add:

we talk about programming like it is about writing code, but the code ends up being less important than the architecture, and the architecture ends up being less important than social issues.

https://neugierig.org/software/blog/2020/05/ninja.html

At 0.5mloc, social issues dominate everything. Your org chart is even more important than your database schema, engineering culture trumps written code-style, etc.

Of the 0.5m lines of code, the most important 10k lines are those that were written first. In my experience, there’s comparatively little difference between a project when it is 10kloc old and a project that’s 300kloc old — the second one is bigger, but is not actually all that different. The design you were lazily turning in your head while dreaming about the project will be the design you’ll end up with. You can and should iterate on the tactical level. Strategically, it’s all about up-front waterfall.

Tests, build and CI make or break a project.

jumpnbrownweasel · December 19, 2023, 3:48pm

The modern version of this is an entity component system, such as bevy_ecs. They came out of the game dev world, but they're generally very useful, not just for games.

anon80458984 · December 21, 2023, 10:33am

I see that Rust analyzer is at 300k+ LOC. At this point, is significant refactorings of the core (non plugin) parts still possible? Or, applied to Rust-Analyzer, were the core structures in stone after you wrote the fist 10k ?

anon80458984 · December 21, 2023, 10:44am

One clarification, maybe not the literal first 10k lines, but the first 10k lines of Rust-analyzer-core.

anon80458984 · December 21, 2023, 11:15am

When i first read this, I thought it was stupid. However, upon further reading:

"port" matches up quite well with "declaring a trait"
"adapter" matches up quite well with "implementing a trait"

and "abusing" this to "invert" the N-layered architecture into the port/adapter is quite similar to what I am going for

matklad · December 21, 2023, 11:47am

That's a tricky to answer question: I would say that large refactors of internals are possible, but painful, but that the overall architecture is pretty-much set in stone. But if you ask me where's a boundary between a large refactor and re-architectureing, i would say "well, exactly there where you can't do changes anymore".

So let me give you some principles which could no longer be changed in the current code, but which could have been coded differently from the beginning:

single-version principle --- rust-analyzer always has a single snapshot of code at a time, time is modeled by changing this snapshot wholesale. That's not the only way to do it. RLS treated code as mutable, Roslyn allows holding several different snapshots at the same time as everything is immutable.
lazy-analysis principle --- rust-analyzer is secretly rust-avoid-analysis-at-any cost, it intentionally knows only subset of things about codebase. So, eg, when you do "find usages" in rust-analyzer, what happens is not that rust-analyzer looks into use-def chains it got while "compiling" the code, but rather it runs a heuristic text-based search (Find Usages), and then uses lazy analysis to prune out false positives (that's why searching for new is way slower than searching for frobnicate).

An alternative here is for a language server to maintain a fully complete view of the code base (something which might be desired to push all the way towards incremental binary patching and live reload)
As-if-analysis is complete principle --- the laziness is abstracted away. All IDE features are building on top of a model which looks as if there's a completely compiled version of a snapshot of the source code is available. An alternative would be more explicit phasing in the IDE parts, where you don't just get the info, but schedule specific computations to run.

In contrast, here are some tactical things which are feasible to change:

migrate typecheckper to a library shared with rustc
upgrade salsa from "sea of Arcs" to "array with indexes" version
maybe change cancellation from unwinding to explicit results or async, but not removing support for cancelation altogether

While we are at it, a related story about an org chart:

github.com/rust-lang/rust-analyzer

Replace `TokenMap` with an abstraction that matches reality

opened 07:48AM - 25 Jun 21 UTC

closed 08:37PM - 04 Dec 23 UTC

matklad

E-hard fun C-Architecture

AKA, @matklad have been misunderstanding how macro expansion works this whole ti…me. Background: originally, I thought about macro expansion process as transforming a stream of tokens into a different strem of tokens: ```rust macro_rules! id { (($id:tt)*) => {($id)*} } fn main() { let foo = 92; id!(foo) } ``` Here, I thought that token `foo` gets translated from macro call site to macro expansion site. This motivated the `TokenMap` and related abstractions. The idea is that we assign ids to tokens (=tokens have identity), and track those ids through macro expansion. Yesterday, having looked at https://doc.rust-lang.org/stable/proc_macro/struct.Span.html, I concluded that this is not, in fact, how the world works. Consider these two procedural macros: ```rust #[proc_macro] pub fn id(args: TokenStream) -> TokenStream { args } #[proc_macro] pub fn id2(args: TokenStream) -> TokenStream { clone_stream(args) } fn clone_stream(ts: TokenStream) -> TokenStream { ts.into_iter().map(clone_tree).collect() } fn clone_tree(t: TokenTree) -> TokenTree { match t { TokenTree::Group(orig) => { let mut new = Group::new(orig.delimiter(), clone_stream(orig.stream())); new.set_span(orig.span()); TokenTree::Group(new) } TokenTree::Ident(orig) => TokenTree::Ident(Ident::new(&orig.to_string(), orig.span())), TokenTree::Punct(orig) => { let mut new = Punct::new(orig.as_char(), orig.spacing()); new.set_span(orig.span()); TokenTree::Punct(new) } TokenTree::Literal(orig) => { ... }, } } ```` I believe their semantics is the same -- from rustc point of view, they produce equivalent outputs. The implementation of `id2` completely erases identity though. So, bad news, we need to rewrite TokenMap-based stuff to use something else (and I don't know what that something else would be). Good news -- I think this should make more weird cases like `include` work in a more out-of-the-box way perhaps? cc @jonas-schievink , @edwin0cheng

This was a huge architectural bug in rust-analyzer. It was fixed recetly through heroic work of @Veykril, but, as you can see, it took us years to do something about the thing which is very clearly wrong, and wrong in a viral way (everything building on top of this wrong abstraction is also wrong).

But what's most curious here is the social aspect. The first order technical story here is that @matklad just didn't get how macros in Rust actually work back when the infra for macro expansion was coded for rust-analyzer. I implemented what I imagined to be the way macros work, but that was incorrect, and it took me some years to recognize that. Which is OK --- compilers are hard, I am of limited smartness, mistakes are being made all the time, 64k should be enough for everybody.

What is really curious is that I identified "I don't know how macro expansion works" as a core risk from the very beginning. You can read about that in the very first paragraph that announced the thing that was to become rust-analyzer: RFC: libsyntax2.0 by matklad · Pull Request #2256 · rust-lang/rfcs · GitHub. I also recall specifically trying to get at this question of macro expansion at the second rust all-hands in Berlin (really, Rust was able to fit in a single (big) room in those days!). But, like, it is literally impossible to transfer the knowledge between the two code-bases (rusct and rust-analyzer), unless there's someone who works to a large capacity in both. Both sides might be very much willing to talk shop and share all the knowledge they have, but the knowledge doesn't actually register until you go and start solving the problems yourself.

EDIT: to clarify, yes, all those aspects (and many other similar) were decided within the first 10k lines. I would even say before the first real line was written --- rust-analyzer is pretty much an execution of a design I arrived at somewhere in 2016 I guess? The macros again make an interesting study --- the actual code for macro expansion was written relatively late, I think past the 10k lines mark, certainly after basic type inference. But those 10k lines were determined, in a significant part, by the macro expansion code that was yet to be written!

khimru · December 21, 2023, 12:23pm

It's absolutely always possible in any project. The question is always about time and effort needed to achieve that.

The best example is Linux kernel. Remember?

It is NOT protable (uses 386 task switching etc), and it probably never will support anything other than AT-harddisks, as that's all I have

Many design decisions were fixed early on, when Linux was small enough for that to happen easily, but not everything.

In particular Linux was envisioned as single-threaded OS, unlike Windows NT or OS/2 which were always designed as SMP-capable.

Yet today Linux does multi-threading better than Windows and, of course, better than today's version of OS/2 (yes, you can still buy OS/2 today and it even supports UEFI these days).

But note the timeline:

1991 — Linux posts that aforementioned post about arrival of Linux
circa 1995 — horrible hack is added to the Linux to make a poor man's SMP work
2011 — horrible hack is finally removed, this time for real

Similar surgery happened with real time linux: hacky version was made in 2004, then it was refined to make it suitable for mainline Linux and not to make changes too disruptive, and that refactoring is close to being finished now.

So in about 10-20-30 years even huge codebase may be refactored pretty radically.

If there are enough interest and funding, of course.

But, well… in about 10-20-30 years even huge codebase may be refactored pretty radically.

Is this glass half-full or half-empty?

You decide.

tschundler · December 23, 2023, 5:50am

This is generally the root of problems in big systems I've dealt with, independent of language. There's a computer engineering concept of "high cohesion, low coupling" that becomes increasingly critical the larger the project gets.

I don't think you can reasonably look at 500kloc as one product. It's multiple products working together, and it's essential that those parts have APIs that don't leak implementation details. Minimize the dependencies of each component (possibly building meta components to wrap commonly used groups). Never reach "through" a component. A component's API shouldn't directly reveal what it depends on allowing you to manipulate the dependency directly. That will greatly limit the scope of a lot of refactors.

And the larger the codebase, the more essential test coverage (particularly coverage through the "font door" of those public APIs) becomes. Your Language Server may help you track all uses of something that changes, but it doesn't guarantee you made the transformation correctly.

(https://www.youtube.com/watch?v=wEhu57pih5w&list=PLD0011D00849E1B79 is pretty good playlist)

geebee22 · December 23, 2023, 6:35am

I have been doing this for a long time to earn a living in a simple and easy way. My philosophy (or whatever you want to call it) is that there should be a small thin non-DBMS layer that (almost) never changes (perhaps a few hundred lines of code) and is (almost) completely ignorant of the application. It has worked very well for me and my clients. The main difficulty then is getting the design of the database tables "right", and also having an well-automated way to generate user interfaces, making maintenance straight-forward. It is a little boring as there is very little "code" to be written, instead it is a matter of configuration.

khimru · December 23, 2023, 11:17am

There are definitely separate parts in large products. Heck, even in 50k codebase there are separate parts.

But whether they leak implementation details or not is not an option: they do. Always. No exceptions.

The big question is: how do you fix mistakes in these APIs? You may have the rule that there are no stable API and then your whole codebase may [slowly] evolve. Google even does that with codebase measured in billlions lines of code (e.g. just recently they switched all of that from C++17 to C++20 which was significant engineering challenge but that was done).

Or you may introduce stable API and then your parts become separate products for real (and they lose the ability to evolve because of Hyrum's law).

In latter case you may still evolve your product Apple-style (when you just break backward compatibility from time to time and ignore cryes and curses of people left behind…), they may do that because Apple is the most lucrative platform today and developers are forced to endure it. But even Apple have to moderate amount of pain it inflicts on developers… if they would pass certain threshold they would leave Apple platform anyway.

Yes. But the price is high: most refactorings would become easier, some would become flat out impossible in exchange.

Whether you think it's good thing or bad thing depends on your needs, really.

tschundler · December 24, 2023, 11:23pm

And xkcd: Workflow

That's fair, while you can reduce the problem, but it doesn't mean you'll never hit it. A key point to Hyrum's law though is the number of consumers on a API. Inside one code base it is not huge, and so in my experience, less of a problem.

But you will still hit the problem, and that's where the tests become critical. (and making minimal use of mocks and/or having separate integration tests) Since we're discussing inside one codebase, then the interface is not locked in stone, and you can bug the team whose code you broke, or maybe they're your own. (and can even work apparently at the scale of all of internal Google)

In my experience tests are not very useful for avoiding bugs in new code, they are useful for confident refactoring. (Not perfect refactoring)

And to that point, the more changes you do, generally the better product you end up with. (in my experience) It forces you to keep the abstractions that work well and ditch the ones that you thought would be good, but aren't sufficiently flexible in practice. Code bases will less churn tend to be tougher to fix.

Public APIs used outside your team/org/company are a different, trickier issue, where you are more prone to Hyrum's law by virtue of more consumers. And yes, you either break anyone relying on undocumented behavior, or introduce a new API. (Typically for me with the old interface internally calling the new API + some hack to make it work for old callers so no synchronization is needed for release of mobile apps/microservices/etc.)

(I think Rust's solution there for multiple versions of a crate in one product is an interesting one)

khimru · December 25, 2023, 10:42am

Not just the sheer number, but, more importantly, whether you care about these who couldn't upgrade.

G++ compiler developers, initially, proclaimed that they do not care (and GNAT developers still don't care). That's why today it uses libstdc++.so.6 (and GNAT uses libgnat-13.so). But libstdc++.so.6 goes back to GCC 3.4. April 18, 2004. That means that API development stopped at that day. You may never change it, you may ever fix warts in it, you may never make std::tuple<i32> to be returned in register, not via pointer.

It's all about tradeoffs. You can always refactor everything unless API is frozen because you have users who couldn't recompile their code.

Yes and that is why I hate unit tests (as they are done with Java mock libraries) with passion.

Most of the time these ubiquitous unit-tests that people create with jMock are just mirroring the code of actual module and ensure that it couldn't be [easily] changed. And since, as you have noted, tests don't help avoiding bugs in the code and they are more-or-less useless for refactorings (they are tied to the implementation of your code, not to interface) then what's the point? Why would I want/need to carry them?

Most of the time tests have to cover and use official API of your module (which doesn't change when you do various refactorings), not mirror it's internal structure! Then they become useful for refactorings.

After enough ossification they may even be declared impossible to fix. In reality it's never 100% true, with enough time and dedication you may fix everything, but if you need literally years to only write a series of tests which would cover your program API with enough confidence to start refactoring it… full rewrite maybe easier and cheaper, at this point.

Yeah. No one tried it on billions lines of code, yet, but so far it works adequately well.

Vorpal · December 25, 2023, 12:22pm

This depends on the style of test. For traditional unit tests: very much so. But there are other options: property based tests, regression tests, integration tests, fuzz testing etc.

They all are differently useful for different types of code in different stages of the life of the code base.

When writing a compiler I found differential fuzz testing very useful: I would use libfuzzer to generate a subset of the language (no unbounded loops, no io except stdout), compile with and without the optimiser and compare the output of running those programs. Helped find lots of bugs in my optimiser!

marvin-hansen · December 25, 2023, 5:00pm

Okay, my experience in scaling up code bases tenfold and beyond in a relatively short time is highly opinionated and I understand that a lot of people either disagree or have different views. That’s okay. I respect everyone regardless of their views.

Here are my three main take aways:

One crate per component / one crate per service, if it’s about microservices. Aggregate components in other components to manage complexity.

Why?

keeps build times in check
makes all dependencies explicit
crates makes refactoring a lot easier

Keep folder structure flat as it keeps buil time in check.

The above approach keeps incremental build times at second or even sub-second level for the top level crates and these are where you spent most of the time unless your fixing design errors or similar.

When cargo can’t do the job anymore, build with Bazel. You really have to reach hundreds of crates and possibly 500k LoC to get there, but if you do Bazel is your life safer.

See the following post for details.

At that level, remote builds may make sense.

Also, when the final deliverables are Docker containers, you may suffer again from slow container builds in release mode. Happened to me more than once.
In this case, earthly build might be your best option. You can use it in tandem with either Cargo or Bazel.

I ended up wrapping everything Bazel and other tools in bash scripts and a make file. Basically the dev workflow is then something like:

make build
make test
make release

To this day, all my projects have a makefile to abstract over the exact build and tests tools so that I can replace them without breaking the workflow. Also, this makes switching projects really fast and easy regardless of what’s running under the hood.

If, for any reason, you don’t like Bazel, that’s cool. No objection. It’s just my experience of scaling Golang and more recently Rust projects, that Bazel saved the day when everything else hits the wall. Good news is that Cargo holds strong for a long time.

Consistent linting, code formatting and code standards become so much more important as things grow. For example, I use template scripts to generate scaffold components and microservices that are all identically structured mainly to reduce the cognitive overhead when going through tons of code. Again, define a good process, make it a script and run it either with make or GitHub hooks or actions.

In one fast growing Golang project, I’ve put the component template in a gut repo for versioning purpose and used a script that pulled the repo, customized the component based on parameters from the script, and then generated the missing Bazel files to ensure the entire mono repo could build immediately after generating a new standard component from the template. I don’t do this in Rust these days because I’m using a much simpler component model.

That said, standards definitely help to reduce complexity and parametric templates enforce standards while giving enough wiggle room for customization.

On that topic, I tend to stay clear from custom macros as much as possible mainly out of the painful experience that, if you aren’t paying meticulous attention, your build time tank’s rapidly. Optimizing macros for compile time is possible, indeed, but you have to ask if it’s really time well spent or if a dumb but really fast code template can do the job as well?

However, there is no right or wrong way to scale up your code base. These three best practices have stood my test of time, but I’m not overly attached to any particular tool. Can Bazel drive you nuts when they decide to replace the workspace file and ask you to migrate this monstrosity? Sure enough. Can Bash scripts be a pain to debug at times? Absolutely. You always add pipefail. Don’t you?

But consider the alternatives carefully as your code base grows another 10x and all of a sudden the equation changes dramatically. Then a carefully crafted and streamlined development flow pays big dividends as you move forward.

Lastly, embrace Ai coding assistance whenever it makes sense. For me, Cody implements most of the standard traits and overall does a good job.

But hey, it’s my opinion, you do you, and use what works best for you, your team, and your organization.

anon80458984 · December 25, 2023, 6:16pm

I read that post. I like flat crate structure. I like fast build times. However, I don't the connection between flat crate layout vs fast build time. How are they connected ? [ I understand the importance of flat crate dependency tree, but not flat crate layout on file system ]

anon80458984 · December 25, 2023, 6:20pm

I am interested in hearing what a 'Rust component' means here, as it seems key to your workflow.

Topic		Replies	Views
Small dependencies in rust? do or don't?	34	6439	January 12, 2023
The Rust compiler isn't slow; we are announcements	40	8728	September 19, 2020
Why are most crates > 1000 LOC?	26	3111	January 29, 2023
Feeling down about Rust for serious projects help	24	3768	April 25, 2024
Why is compilation time such a big deal for you? community	57	5448	July 31, 2020

Soft question: scaling codebase 50k loc -> 500k loc

Related topics