Reducing compiled time for generated data

Hello,

I have a large-ish (230K lines) auto-generated files containing mostly data. (Anonymized source)

Attached is the -Z time-passes output.

Except for Derefer which I've figured out I can almost eliminate by doing s/&LINKS/LINKS, are there any other low-hanging fruit?

(I realize that macro expansion is probably not something I can reduce since PHF performs some heavy calculation)

Is there a way to generate this in a way that reduces, for example MIR_borrow_checking (~3s)?
Is there a way to tell rustc to skip linting completely for this file? (module_lints + lint_checking = ~4s)

misc_checking_3 sounds vague, but from what I can understand from the code it's just lint_checking in a trenchcoat.

❯ cargo +nightly rustc -- -Z time-passes
   Compiling proc-macro2 v1.0.60
   Compiling siphasher v0.3.10
   Compiling quote v1.0.28
   Compiling unicode-ident v1.0.9
   Compiling syn v1.0.109
   Compiling rand_core v0.6.4
   Compiling rand v0.8.5
   Compiling phf_shared v0.11.1
   Compiling phf_generator v0.11.1
   Compiling phf_macros v0.11.1
   Compiling phf v0.11.1
   Compiling long_compile v0.1.0 (/home/dn/repos/long_compile_reproduce)
time:   0.001; rss:   38MB ->   41MB (   +3MB)	parse_crate
time:   0.000; rss:   41MB ->   41MB (   +0MB)	blocked_on_dep_graph_loading
time:   0.000; rss:   41MB ->   43MB (   +2MB)	setup_global_ctxt
time:   2.006; rss:   45MB ->  206MB ( +161MB)	expand_crate
time:   2.007; rss:   45MB ->  206MB ( +161MB)	macro_expand_crate
time:   0.007; rss:  206MB ->  206MB (   +0MB)	AST_validation
time:   0.001; rss:  206MB ->  208MB (   +1MB)	finalize_macro_resolutions
time:   0.108; rss:  208MB ->  232MB (  +24MB)	late_resolve_crate
time:   0.006; rss:  232MB ->  232MB (   +0MB)	resolve_check_unused
time:   0.012; rss:  232MB ->  232MB (   +0MB)	resolve_postprocess
time:   0.128; rss:  206MB ->  232MB (  +25MB)	resolve_crate
time:   0.006; rss:  232MB ->  232MB (   +0MB)	prepare_outputs
time:   0.006; rss:  232MB ->  232MB (   +0MB)	complete_gated_feature_checking
time:   0.021; rss:  296MB ->  279MB (  -17MB)	drop_ast
time:   0.312; rss:  232MB ->  262MB (  +30MB)	looking_for_derive_registrar
time:   0.377; rss:  232MB ->  263MB (  +31MB)	misc_checking_1
time:   0.035; rss:  263MB ->  263MB (   +0MB)	type_collecting
time:   0.018; rss:  263MB ->  290MB (  +27MB)	coherence_checking
time:   0.086; rss:  290MB ->  292MB (   +2MB)	wf_checking
time:   1.245; rss:  292MB ->  340MB (  +48MB)	item_types_checking
time:   1.398; rss:  263MB ->  348MB (  +86MB)	type_check_crate
time:   0.019; rss:  354MB ->  358MB (   +3MB)	PromoteTemps
time:   0.000; rss:  409MB ->  409MB (   +0MB)	PromoteTemps
time:   0.000; rss:  409MB ->  410MB (   +0MB)	PromoteTemps
time:   0.000; rss:  410MB ->  410MB (   +0MB)	PromoteTemps
time:   0.000; rss:  410MB ->  411MB (   +0MB)	PromoteTemps
time:   0.000; rss:  413MB ->  413MB (   +0MB)	PromoteTemps
time:   0.000; rss:  415MB ->  415MB (   +0MB)	PromoteTemps
time:   0.000; rss:  415MB ->  416MB (   +0MB)	PromoteTemps
time:   0.000; rss:  416MB ->  416MB (   +0MB)	PromoteTemps
time:   0.000; rss:  419MB ->  420MB (   +0MB)	PromoteTemps
time:   0.000; rss:  420MB ->  420MB (   +0MB)	PromoteTemps
time:   0.000; rss:  420MB ->  420MB (   +0MB)	PromoteTemps
time:   0.000; rss:  423MB ->  423MB (   +0MB)	PromoteTemps
time:   0.000; rss:  425MB ->  425MB (   +0MB)	PromoteTemps
time:   0.000; rss:  427MB ->  427MB (   +0MB)	PromoteTemps
time:   0.000; rss:  428MB ->  428MB (   +0MB)	PromoteTemps
time:   0.000; rss:  428MB ->  428MB (   +0MB)	PromoteTemps
time:   0.000; rss:  432MB ->  432MB (   +0MB)	PromoteTemps
time:   0.018; rss:  504MB ->  504MB (   +0MB)	PromoteTemps
time:   3.363; rss:  348MB ->  588MB ( +240MB)	MIR_borrow_checking
time:   0.003; rss:  588MB ->  589MB (   +1MB)	ElaborateDrops
time:   5.297; rss:  582MB ->  562MB (  -20MB)	Derefer
time:   0.131; rss:  562MB ->  564MB (   +2MB)	ElaborateDrops
time:   6.258; rss:  588MB ->  564MB (  -24MB)	MIR_effect_checking
time:   0.010; rss:  565MB ->  565MB (   +0MB)	crate_lints
time:   0.004; rss:  567MB ->  566MB (   -1MB)	Derefer
time:   0.015; rss:  566MB ->  573MB (   +7MB)	ElaborateDrops
time:   0.000; rss:  565MB ->  565MB (   +0MB)	SimplifyCfg-early-opt
time:   0.000; rss:  565MB ->  566MB (   +0MB)	ElaborateDrops
time:   0.000; rss:  566MB ->  566MB (   +0MB)	ElaborateDrops
time:   0.000; rss:  581MB ->  581MB (   +0MB)	SimplifyCfg-early-opt
time:   1.968; rss:  565MB ->  653MB (  +88MB)	module_lints
time:   1.978; rss:  565MB ->  653MB (  +88MB)	lint_checking
time:   0.092; rss:  653MB ->  653MB (   +0MB)	privacy_checking_modules
time:   2.099; rss:  564MB ->  653MB (  +89MB)	misc_checking_3
time:   0.020; rss:  650MB ->  651MB (   +1MB)	monomorphization_collector_graph_walk
time:   0.186; rss:  653MB ->  653MB (   +0MB)	generate_crate_metadata
time:   0.175; rss:  653MB ->  697MB (  +44MB)	codegen_to_LLVM_IR
time:   0.179; rss:  653MB ->  698MB (  +45MB)	codegen_crate
time:   0.000; rss:  698MB ->  699MB (   +0MB)	encode_query_results_for(hir_module_items)
time:   0.000; rss:  699MB ->  700MB (   +1MB)	encode_query_results_for(generics_of)
time:   0.086; rss:  700MB ->  736MB (  +36MB)	encode_query_results_for(mir_for_ctfe)
time:   0.000; rss:  736MB ->  736MB (   +0MB)	encode_query_results_for(optimized_mir)
time:   0.047; rss:  736MB ->  760MB (  +24MB)	encode_query_results_for(promoted_mir)
time:   0.000; rss:  760MB ->  760MB (   +0MB)	encode_query_results_for(explicit_predicates_of)
time:   0.000; rss:  760MB ->  760MB (   +0MB)	encode_query_results_for(unsafety_check_result)
time:   0.018; rss:  760MB ->  770MB (  +10MB)	encode_query_results_for(typeck)
time:   0.002; rss:  770MB ->  771MB (   +1MB)	encode_query_results_for(eval_to_allocation_raw)
time:   0.001; rss:  771MB ->  771MB (   +0MB)	encode_query_results_for(symbol_name)
time:   0.000; rss:  771MB ->  771MB (   +0MB)	encode_query_results_for(def_span)
time:   0.001; rss:  771MB ->  773MB (   +1MB)	encode_query_results_for(codegen_fn_attrs)
time:   0.002; rss:  773MB ->  774MB (   +1MB)	encode_query_results_for(specialization_graph_of)
time:   0.000; rss:  774MB ->  775MB (   +1MB)	encode_query_results_for(has_ffi_unwind_calls)
time:   0.000; rss:  775MB ->  775MB (   +0MB)	encode_query_results_for(unused_generic_params)
time:   0.164; rss:  698MB ->  775MB (  +77MB)	encode_query_results
time:   0.228; rss:  698MB ->  811MB ( +113MB)	incr_comp_serialize_result_cache
time:   0.228; rss:  698MB ->  811MB ( +113MB)	incr_comp_persist_result_cache
time:   0.228; rss:  698MB ->  811MB ( +113MB)	serialize_dep_graph
time:   0.101; rss:  811MB ->  564MB ( -247MB)	free_global_ctxt
time:   0.987; rss:  689MB ->  524MB ( -165MB)	LLVM_passes
time:   0.000; rss:  516MB ->  516MB (   +0MB)	join_worker_thread
time:   0.000; rss:  514MB ->  508MB (   -6MB)	copy_all_cgu_workproducts_to_incr_comp_cache_dir
time:   0.645; rss:  564MB ->  508MB (  -56MB)	finish_ongoing_codegen
time:   0.000; rss:  469MB ->  469MB (   +0MB)	incr_comp_finalize_session_directory
time:   0.104; rss:  465MB ->  403MB (  -62MB)	link_rlib
time:   0.522; rss:  403MB ->  403MB (   +0MB)	run_linker
time:   0.631; rss:  466MB ->  403MB (  -63MB)	link_binary
time:   0.645; rss:  466MB ->  366MB ( -100MB)	link_crate
time:   1.292; rss:  564MB ->  366MB ( -198MB)	link
time:  17.665; rss:   29MB ->   90MB (  +60MB)	total
    Finished dev [unoptimized + debuginfo] target(s) in 22.12s

Thanks

230'000 lines tells me that this data most likely shouldn't be stored as raw Rust source.

Are you sure you didn't want at least an efficient binary serialization format (eg. bincode, cbor, BSON), or better yet, an actual database?

4 Likes

I think there's value in representing such metadata as code, because it is very easy to consume, and I can ensure it is not duplicated in memory by compiling it into a shared object. (Or at least I could in C, still not sure if I can ensure this in Rust)

Representing this in BSON means that I have to pay O(N) (amount of processes) memory for this, which is something I am hesitantant to do, because that's one of the problems I'm trying to solve.

I do gather from your comment that we might have hit a wall WRT what we can do in Rust code, so it might be easier for me to generate C/C++ and wrap the data in Rust.

Thanks! :slight_smile:

If you have N records of static data in your Rust code, that's also going to occupy O(N) space. So I don't understand what you are getting at. The data has to be stored somewhere, after all – if you scatter it between the TEXT and RODATA sections of a binary (for example), then it's still going to consume the same amount of memory, so you don't reduce memory footprint, you are simply putting it elsewhere.

Also, using a real database or a serialization format would allow you to compress the data, and only incrementally decompress it at runtime. That's not possible with literal code.

No, not at all, and C and C++ compilation is slow, too. I answered what I answered because I think it would be a much more fruitful idea to use the right tool for the job instead of trying to micro-optimize something that wasn't meant to be used for this purpose.

2 Likes

FWIW, recently I tried to do something similar, in that I generate a table of data and wanted to compile that to rust code, so I could efficiently do lookups.

A generated file of about 60k SLOC took forever to compile, so I quickly gave up on that idea. Now I just serialize the data using serde and some renaming tricks to compress type and field names, and it's working well enough for my purposes.

So perhaps serialization is something to consider :slight_smile:

4 Likes

Sorry, I wasn't clear.
My plan is to put it on a dylib, counting on it to be placed on pages that will be shared with multiple processes.

If I serialize this to BSON (as an example), I could also assume that the serialized BSON is shared, but I will have to pay per-process for deserializing it.

Both deseralization and DB access will introduce latency and extra allocations I would like to avoid.

Compression is a neat idea, but the data is not singularly big (the SO generated is only a single digit MiB), so compressing it after putting it in a DB might not be that beneficial for me.

My main concerns are making sure latency is close to nothing, and that the cost I'll have to pay, memory-wise, is not going to be affected by the amount of processes using this metadata, whicy is expected to be large.

It's possible that I'm exaggerating; but to me, turning something that is today an array indexing operation into a context switch sounds like something that I might not want to pay.

I mentioned C/C++ simply because we already have similarly sized metadata SOs that are faster to compile.
I.e. it's already a "proven" concept and something that's very easy for us to consume.
(Not insinuating that C++ is faster to compile in general, btw, just referring to this specific use case)

I do appreciate trying to get me out of this box :),
It's not impossible that when we have more usage data and benchmarks that we'll implement one of your suggestions, but I'm going to insist that at least for now it's not an XY problem.

When I did mostly C programming, I'd occasionally pull some linker tricks for this sort of thing: I'd write my table in #[repr(C)] format to a file, and then use objcopy to make it a .o file with one exported symbol. My code could then access it like a normal array via extern static.

I don't see any reason why you couldn't pull a similar trick with Rust: You can use include_bytes!() to import the data file and bytemuck to cast it to your real data structure, but you'll need to be careful about memory alignment (and maybe endianness).

7 Likes

There is already non-negligible latency in loading and dynamically linking a shared library (which is also O(size of library)). You can't achieve near zero-latency dynamic loading.

You can't guarantee that with dynamic libraries, either. They may or may not actually be shared by the OS; and data is especially problematic, because code is always assumed to be immutable, but data isn't.

Also, I sense some incoherence in your argument. First you assert that the data is not big, but then you argue that it's too big so you don't want to copy it. You'll have to decide which one it is.

2 Likes

Dynamically loading happens once on process startup. I don't mind paying for it.
Accessing the data happens all the time during the lifetime of the process.
It's not that I don't want to pay anything, I am aware that bytes do not appear out of thin air.
I simply want to pay for them in a specific place and time.

The data set itself is not huge, my problem is its duplication.
I want to deduplicate using SO (which I gather is no guarenteed).

You suggest to deduplicate using a DB.
Either way the entire set of data appears only once in a system, so in this specific scenario, compressing it is unnecessary.

I don't see it as an incoherency; maybe just a miscommunication on my part.

Anyway, I think I understood the bottom line, which is that there's not much I can do about the compile time, and that I should redesign this.

Thank you all, and have a nice day

Databake takes the exact same approach to store data with low overhead. They may have already encountered similar issues.

3 Likes

But in your previous post, you said that it worries you because there will be many processes.

You can do the same thing with deserialization: just put the deserialization code in main(), or in a static Lazy<YourDataStructure>.

1 Like

That's a pretty cool idea, I'll admit.
And if the target arch was a given, I'd seriously consider it.

But given the need for flexibility in deployment, the downside you mention would be a real headache to deal with.
In addition, when I say the current solution is good enough, instantiation of the data set is a matter of milliseconds.

2 Likes

But in your previous post, you said that it worries you because there will be many processes.

I apologize, I just don't see the contradiction.
I have many long-running processes.
Assuming page-sharing works (again, I now understand I cannot count on it), I do not mind startup taking a bit more time.

You can do the same thing with deserialization: just put the deserialization code in main(), or in a static Lazy<YourDataStructure>.

Which is definitely an option that is easy to implement, but doesn't satisfy any of the concerns I am trying to address.


If anything, I'm interested in what ways "You can't guarantee that with dynamic libraries, either".
Will my data won't end up in RODATA, or do I simply not have the guarantee that RODATA isn't shared?

I think that we have all of the tools available to encapsulate the ugliness in a utility crate:

  • #[cfg(target_endian = ...)] can detect whether or not we need to fixup the byte order
  • bytemuck provides tools to check the safety of casting between &[u8] and &[T]
  • There are ways to make the compiler put included bytes on a specific alignment.

Maybe I'll get around to pulling this all together at some point...

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.