Soft question: scaling codebase 50k loc -> 500k loc

matklad · December 21, 2023, 11:47am

That's a tricky to answer question: I would say that large refactors of internals are possible, but painful, but that the overall architecture is pretty-much set in stone. But if you ask me where's a boundary between a large refactor and re-architectureing, i would say "well, exactly there where you can't do changes anymore".

So let me give you some principles which could no longer be changed in the current code, but which could have been coded differently from the beginning:

single-version principle --- rust-analyzer always has a single snapshot of code at a time, time is modeled by changing this snapshot wholesale. That's not the only way to do it. RLS treated code as mutable, Roslyn allows holding several different snapshots at the same time as everything is immutable.
lazy-analysis principle --- rust-analyzer is secretly rust-avoid-analysis-at-any cost, it intentionally knows only subset of things about codebase. So, eg, when you do "find usages" in rust-analyzer, what happens is not that rust-analyzer looks into use-def chains it got while "compiling" the code, but rather it runs a heuristic text-based search (Find Usages), and then uses lazy analysis to prune out false positives (that's why searching for new is way slower than searching for frobnicate).

An alternative here is for a language server to maintain a fully complete view of the code base (something which might be desired to push all the way towards incremental binary patching and live reload)
As-if-analysis is complete principle --- the laziness is abstracted away. All IDE features are building on top of a model which looks as if there's a completely compiled version of a snapshot of the source code is available. An alternative would be more explicit phasing in the IDE parts, where you don't just get the info, but schedule specific computations to run.

In contrast, here are some tactical things which are feasible to change:

migrate typecheckper to a library shared with rustc
upgrade salsa from "sea of Arcs" to "array with indexes" version
maybe change cancellation from unwinding to explicit results or async, but not removing support for cancelation altogether

While we are at it, a related story about an org chart:

github.com/rust-lang/rust-analyzer

Replace `TokenMap` with an abstraction that matches reality

opened 07:48AM - 25 Jun 21 UTC

closed 08:37PM - 04 Dec 23 UTC

matklad

E-hard fun C-Architecture

AKA, @matklad have been misunderstanding how macro expansion works this whole ti…me. Background: originally, I thought about macro expansion process as transforming a stream of tokens into a different strem of tokens: ```rust macro_rules! id { (($id:tt)*) => {($id)*} } fn main() { let foo = 92; id!(foo) } ``` Here, I thought that token `foo` gets translated from macro call site to macro expansion site. This motivated the `TokenMap` and related abstractions. The idea is that we assign ids to tokens (=tokens have identity), and track those ids through macro expansion. Yesterday, having looked at https://doc.rust-lang.org/stable/proc_macro/struct.Span.html, I concluded that this is not, in fact, how the world works. Consider these two procedural macros: ```rust #[proc_macro] pub fn id(args: TokenStream) -> TokenStream { args } #[proc_macro] pub fn id2(args: TokenStream) -> TokenStream { clone_stream(args) } fn clone_stream(ts: TokenStream) -> TokenStream { ts.into_iter().map(clone_tree).collect() } fn clone_tree(t: TokenTree) -> TokenTree { match t { TokenTree::Group(orig) => { let mut new = Group::new(orig.delimiter(), clone_stream(orig.stream())); new.set_span(orig.span()); TokenTree::Group(new) } TokenTree::Ident(orig) => TokenTree::Ident(Ident::new(&orig.to_string(), orig.span())), TokenTree::Punct(orig) => { let mut new = Punct::new(orig.as_char(), orig.spacing()); new.set_span(orig.span()); TokenTree::Punct(new) } TokenTree::Literal(orig) => { ... }, } } ```` I believe their semantics is the same -- from rustc point of view, they produce equivalent outputs. The implementation of `id2` completely erases identity though. So, bad news, we need to rewrite TokenMap-based stuff to use something else (and I don't know what that something else would be). Good news -- I think this should make more weird cases like `include` work in a more out-of-the-box way perhaps? cc @jonas-schievink , @edwin0cheng

This was a huge architectural bug in rust-analyzer. It was fixed recetly through heroic work of @Veykril, but, as you can see, it took us years to do something about the thing which is very clearly wrong, and wrong in a viral way (everything building on top of this wrong abstraction is also wrong).

But what's most curious here is the social aspect. The first order technical story here is that @matklad just didn't get how macros in Rust actually work back when the infra for macro expansion was coded for rust-analyzer. I implemented what I imagined to be the way macros work, but that was incorrect, and it took me some years to recognize that. Which is OK --- compilers are hard, I am of limited smartness, mistakes are being made all the time, 64k should be enough for everybody.

What is really curious is that I identified "I don't know how macro expansion works" as a core risk from the very beginning. You can read about that in the very first paragraph that announced the thing that was to become rust-analyzer: RFC: libsyntax2.0 by matklad · Pull Request #2256 · rust-lang/rfcs · GitHub. I also recall specifically trying to get at this question of macro expansion at the second rust all-hands in Berlin (really, Rust was able to fit in a single (big) room in those days!). But, like, it is literally impossible to transfer the knowledge between the two code-bases (rusct and rust-analyzer), unless there's someone who works to a large capacity in both. Both sides might be very much willing to talk shop and share all the knowledge they have, but the knowledge doesn't actually register until you go and start solving the problems yourself.

EDIT: to clarify, yes, all those aspects (and many other similar) were decided within the first 10k lines. I would even say before the first real line was written --- rust-analyzer is pretty much an execution of a design I arrived at somewhere in 2016 I guess? The macros again make an interesting study --- the actual code for macro expansion was written relatively late, I think past the 10k lines mark, certainly after basic type inference. But those 10k lines were determined, in a significant part, by the macro expansion code that was yet to be written!

Topic		Replies	Views
Largest Rust codebases help	7	3578	January 12, 2023
Rust crates worth reading? prefer < 2k loc; hard limit < 5k loc community	7	890	January 23, 2021
Difficult/Long refactoring, suggestions?	10	1621	October 30, 2019
Opinion: First impressions of Rust coming from Java	19	1493	December 19, 2020
Prototyping methodologies community	26	2891	July 31, 2020

Soft question: scaling codebase 50k loc -> 500k loc

Related Topics