Unusually slow compilation on Windows

This single large file (a test case I created to rule out file system bottlenecks) takes roughly 40s to compile (cargo build --release) on Win11, and 20s on the same machine in WSL2 running Ubuntu. A friend confirmed the 2x speed difference on his system and also ruled out Windows Defender as culprit by deactivating it. What gives?

Could you try it on nightly with cargo rustc --release -- -Ztime-passes on Windows and in WSL2? That should give a breakdown of where most time is spent.

sure! Nightly seems a fair bit faster already, but still WSL wins by a substantial margin.

Windows:

cargo 1.78.0-nightly (cdf84b69d 2024-02-02)

   Compiling whyslow v0.1.0 (C:\tmp\whyslow)
time:   0.036; rss:   15MB ->   39MB (  +24MB)  parse_crate
time:   0.000; rss:   39MB ->   39MB (   +1MB)  setup_global_ctxt
time:   0.000; rss:   40MB ->   40MB (   +0MB)  crate_injection
time:   0.447; rss:   40MB ->  128MB (  +89MB)  expand_crate
time:   0.447; rss:   40MB ->  128MB (  +89MB)  macro_expand_crate
time:   0.000; rss:  128MB ->  128MB (   +0MB)  maybe_building_test_harness
time:   0.018; rss:  128MB ->  128MB (   +0MB)  AST_validation
time:   0.000; rss:  128MB ->  128MB (   +0MB)  maybe_create_a_macro_crate
time:   0.001; rss:  128MB ->  130MB (   +1MB)  finalize_imports
time:   0.001; rss:  130MB ->  130MB (   +0MB)  compute_effective_visibilities
time:   0.000; rss:  130MB ->  130MB (   +0MB)  check_hidden_glob_reexports
time:   0.013; rss:  130MB ->  127MB (   -3MB)  finalize_macro_resolutions
time:   0.143; rss:  127MB ->  164MB (  +37MB)  late_resolve_crate
time:   0.014; rss:  164MB ->  164MB (   +0MB)  resolve_check_unused
time:   0.000; rss:  164MB ->  164MB (   +0MB)  resolve_report_errors
time:   0.027; rss:  164MB ->  164MB (   +0MB)  resolve_postprocess
time:   0.199; rss:  128MB ->  164MB (  +36MB)  resolve_crate
time:   0.014; rss:  146MB ->  146MB (   +0MB)  write_dep_info
time:   0.014; rss:  146MB ->  146MB (   +0MB)  complete_gated_feature_checking
time:   0.047; rss:  216MB ->  162MB (  -54MB)  drop_ast
time:   0.342; rss:  146MB ->  140MB (   -6MB)  looking_for_entry_point
time:   0.001; rss:  140MB ->  140MB (   +0MB)  looking_for_derive_registrar
time:   0.000; rss:  152MB ->  152MB (   +0MB)  unused_lib_feature_checking
time:   0.463; rss:  146MB ->  152MB (   +5MB)  misc_checking_1
time:   0.121; rss:  152MB ->  186MB (  +35MB)  type_collecting
time:   0.037; rss:  186MB ->  195MB (   +9MB)  coherence_checking
time:   0.421; rss:  195MB ->  246MB (  +51MB)  wf_checking
time:   2.004; rss:  152MB ->  417MB ( +266MB)  type_check_crate
time:   2.550; rss:  417MB ->  710MB ( +292MB)  MIR_borrow_checking
time:   0.498; rss:  710MB ->  749MB (  +39MB)  MIR_effect_checking
time:   0.000; rss:  749MB ->  749MB (   +0MB)  abi_testing
time:   0.000; rss:  753MB ->  753MB (   +0MB)  crate_lints
time:   0.070; rss:  753MB ->  753MB (   +0MB)  module_lints
time:   0.070; rss:  753MB ->  753MB (   +0MB)  lint_checking
time:   0.113; rss:  753MB ->  754MB (   +1MB)  privacy_checking_modules
time:   0.000; rss:  754MB ->  754MB (   +0MB)  check_lint_expectations
time:   0.257; rss:  749MB ->  754MB (   +5MB)  misc_checking_3
time:   0.009; rss:  754MB ->  755MB (   +1MB)  monomorphization_collector_root_collections
time:   1.945; rss:  755MB ->  783MB (  +28MB)  monomorphization_collector_graph_walk
time:   0.045; rss:  783MB ->  791MB (   +8MB)  partition_and_assert_distinct_symbols
time:   0.000; rss:  791MB ->  791MB (   +0MB)  write_allocator_module
time:   0.000; rss:  791MB ->  791MB (   +0MB)  find_cgu_reuse
time:   0.262; rss:  791MB ->  836MB (  +44MB)  codegen_to_LLVM_IR
time:   2.263; rss:  754MB ->  836MB (  +82MB)  codegen_crate
time:   0.124; rss:  836MB ->  127MB ( -708MB)  free_global_ctxt
time:  23.974; rss:  836MB ->   68MB ( -767MB)  LLVM_passes
time:   0.000; rss:   68MB ->   68MB (   +0MB)  join_worker_thread
time:  23.850; rss:  127MB ->   68MB (  -59MB)  finish_ongoing_codegen
time:   0.000; rss:   67MB ->   67MB (   +0MB)  serialize_work_products
time:   0.760; rss:   69MB ->   69MB (   +0MB)  run_linker
time:   0.784; rss:   67MB ->   69MB (   +2MB)  link_binary
time:   0.784; rss:   67MB ->   69MB (   +2MB)  link_crate
time:  24.635; rss:  127MB ->   69MB (  -58MB)  link
time:  33.521; rss:   12MB ->   48MB (  +36MB)  total
    Finished release [optimized] target(s) in 33.66s

WSL2:

cargo 1.78.0-nightly (cdf84b69d 2024-02-02)

   Compiling whyslow v0.1.0 (/tmp/whyslow)
time:   0.025; rss:   39MB ->   63MB (  +24MB)  parse_crate
time:   0.006; rss:   63MB ->   67MB (   +4MB)  setup_global_ctxt
time:   0.266; rss:   67MB ->  178MB ( +111MB)  expand_crate
time:   0.266; rss:   67MB ->  178MB ( +111MB)  macro_expand_crate
time:   0.011; rss:  178MB ->  178MB (   +0MB)  finalize_macro_resolutions
time:   0.098; rss:  178MB ->  213MB (  +36MB)  late_resolve_crate
time:   0.012; rss:  213MB ->  213MB (   +0MB)  resolve_postprocess
time:   0.126; rss:  178MB ->  213MB (  +36MB)  resolve_crate
time:   0.009; rss:  269MB ->  269MB (   +0MB)  drop_ast
time:   0.199; rss:  213MB ->  251MB (  +38MB)  looking_for_entry_point
time:   0.317; rss:  213MB ->  254MB (  +40MB)  misc_checking_1
time:   0.112; rss:  254MB ->  274MB (  +20MB)  type_collecting
time:   0.100; rss:  274MB ->  288MB (  +14MB)  coherence_checking
time:   0.334; rss:  288MB ->  345MB (  +57MB)  wf_checking
time:   1.648; rss:  254MB ->  571MB ( +318MB)  type_check_crate
time:   1.804; rss:  571MB ->  960MB ( +389MB)  MIR_borrow_checking
time:   0.348; rss:  960MB ->  975MB (  +15MB)  MIR_effect_checking
time:   0.063; rss:  975MB ->  975MB (   +0MB)  module_lints
time:   0.064; rss:  975MB ->  975MB (   +0MB)  lint_checking
time:   0.095; rss:  975MB ->  975MB (   +0MB)  privacy_checking_modules
time:   0.210; rss:  975MB ->  975MB (   +0MB)  misc_checking_3
time:   0.008; rss:  975MB ->  975MB (   +0MB)  monomorphization_collector_root_collections
time:   1.934; rss:  975MB -> 1004MB (  +30MB)  monomorphization_collector_graph_walk
time:   0.037; rss: 1004MB -> 1006MB (   +1MB)  partition_and_assert_distinct_symbols
time:   0.035; rss: 1012MB -> 1016MB (   +4MB)  write_allocator_module
time:   0.216; rss: 1016MB -> 1030MB (  +14MB)  codegen_to_LLVM_IR
time:   2.236; rss:  975MB -> 1030MB (  +56MB)  codegen_crate
time:   0.030; rss: 1030MB ->  780MB ( -251MB)  free_global_ctxt
time:  16.320; rss: 1030MB ->  858MB ( -172MB)  LLVM_passes
time:   0.000; rss:  847MB ->  815MB (  -32MB)  join_worker_thread
time:  16.294; rss:  780MB ->  700MB (  -80MB)  finish_ongoing_codegen
time:   0.457; rss:  700MB ->  127MB ( -573MB)  run_linker
time:   0.471; rss:  700MB ->  127MB ( -573MB)  link_binary
time:   0.471; rss:  700MB ->  127MB ( -573MB)  link_crate
time:  16.765; rss:  780MB ->  127MB ( -653MB)  link
time:  23.804; rss:   30MB ->  105MB (  +75MB)  total
    Finished release [optimized] target(s) in 24.03s
1 Like

Looks like codegen is the biggest contributor. The rest goes from ~9s to ~7s when going from Windows to WSL2, which is not a lot of difference. The difference in codegen time is really big though, which is surprising to me. I would have expected the difference to be caused by the linker, reading source files or macro expansion. Those passes are doing a fair amount of IO, which Windows tends to perform worse at. Codegen however is mostly doing a bunch of multithreaded work and then at the end write an object file in one go with the writing taking up just a fraction of the total time. I honestly don't know what caused it. If you want to dig deeper, the self profile option of rustc could be useful. Otherwise I did recommend using WSL2 or nightly rustc until whatever improved it on nightly has propagated to stable.

1 Like

thank you for writing up all those details! I mainly built this test out of curiosity; do you think I should file a GH issue with rustc? (since even with nightly there's a big performance gap)

A github issue can't hurt.