Improve compile time and executable size by counting lines of LLVM IR


#1

In my experience the bulk of Rust compile time is spent within LLVM, and that time is roughly proportional to number of lines of IR handed to LLVM. Disclaimer: not always the case but hopefully this is helpful anyway.

I made a Cargo subcommand cargo llvm-lines to count lines of LLVM IR broken down by source function. It was helpful to me today in tracking down two large improvements in serde_json compile time (1) (2) by revealing which functions contributed most of the IR or were instantiated more than expected.

The output is one line per function with three columns:

  1. Total number of lines of LLVM IR generated across all instantiations of the function.
  2. Number of instantiations of the function. For a generic function, roughly the number of distinct combinations of generic type parameters it is called with.
  3. Name of the function.

$ cargo llvm-lines | head -20

   2447  130  core::ptr::drop_in_place
   1720   19  <core::option::Option<T>>::map
    862    2  core::str::pattern::TwoWaySearcher::next
    726    4  <alloc::raw_vec::RawVec<T, A>>::double
    698    4  <alloc::raw_vec::RawVec<T, A>>::reserve
    677    6  <core::result::Result<T, E>>::map
    602    1  cargo_llvm_lines::read_llvm_ir
    598    5  <alloc::vec::Vec<T>>::extend_desugared
    477    1  cargo_llvm_lines::count_lines
    476    9  <alloc::raw_vec::RawVec<T, A>>::dealloc_buffer
    464   10  alloc::heap::box_free
    452    5  <alloc::vec::Vec<T> as alloc::vec::SpecExtend<T, I>>::spec_extend
    448    1  alloc::slice::merge_sort
    436    1  <std::process::Command as cargo_llvm_lines::PipeTo>::pipe_to
    419    4  <core::slice::Iter<'a, T> as core::iter::iterator::Iterator>::next
    400    1  <core::hash::sip::Sip13Rounds as core::hash::sip::Sip>::d_rounds
    378    9  <alloc::raw_vec::RawVec<T, A>>::current_layout
    362    3  <alloc::raw_vec::RawVec<T, A>>::allocate_in
    354    4  <alloc::vec::Vec<T>>::push
    341    4  <[T] as core::slice::SliceExt>::iter

#2

Super helpful! Just run this on some of my code, and looks like the main contributor is serde_json, so I am looking forward to those improvements :smile:

λ cargo llvm-lines | head -3
   Compiling lang_fall v0.1.0 (file:///home/matklad/projects/fall/lang/fall)
    Finished dev [unoptimized + debuginfo] target(s) in 14.42 secs
  41272   31  <serde_json::de::Deserializer<R>>::parse_value
   6931   16  <serde_json::de::SeqAccess<'a, R> as serde::de::SeqAccess<'de>>::next_element_seed
   6008  403  core::ptr::drop_in_place

#3

How can you tell a crate is really production ready? – The maintainers already implemented all the features, optimized the performance and are now optimizing the compilation speed :open_mouth:

(Also, I assume this will reduce binary size? In that case, I’m even more happy to see this!)


#4

This is super awesome. Compile times are for me the absolute worst (other parts of Rust are much better, let’s say). I’m totally up for tools to improve this! :smiley:

I’m just giving it a whirl and it seems like it still takes the full compile time to get the pre-LLVM line counts out; is that fundamental somehow? I’m looking at a binary that takes 20mins+ to build in release, and it seems to do a full debug build (352s). Some pretty mental results, though… Awesome data!

Echidnatron% time cargo llvm-lines
   Compiling tpchlike v0.1.0 (file:///Users/mcsherry/Projects/differential-dataflow/tpchlike)
    Finished dev [unoptimized + debuginfo] target(s) in 352.27 secs
1332608 2031  <core::slice::Iter<'a, T>>::search_while
 440031 4621  <core::option::Option<T>>::map
 419929 21396  core::ptr::drop_in_place
 347867  745  alloc::slice::merge_sort
 306676 1394  <alloc::raw_vec::RawVec<T, A>>::double
 219402  745  alloc::slice::insert_head
 208600  745  alloc::slice::merge

#5

Also, I’m a bit ignorant here, but: can you explain the difference between

cargo llvm-lines
cargo llvm-lines --release

I had imagined that optimization went on in LLVM rather than beforehand, but I def get different results here (lots more of my code, rather than Rust’s). Are they doing some inlining pre-LLVM, perhaps?

Echidnatron% time cargo llvm-lines --release
   Compiling tpchlike v0.1.0 (file:///Users/mcsherry/Projects/differential-dataflow/tpchlike)
    Finished release [optimized] target(s) in 1405.18 secs
 634792 12661  core::ptr::drop_in_place
 387275  330  <timely::dataflow::operators::generic::builder_raw::OperatorCore<T, PEP, PIP> as timely::progress::operate::Operate<T>>::pull_internal_progress
 280709   53  <differential_dataflow::collection::Collection<G, (K, V), R> as differential_dataflow::operators::arrange::Arrange<G, K, V, R, T>>::arrange
 161055   96  <differential_dataflow::operators::join::Deferred<K, V1, V2, T, R1, R2, R3, C1, C2, M, D>>::work
 157822   62  <differential_dataflow::trace::wrappers::rc::TraceBox<K, V, T, R, Tr>>::adjust_advance_frontier
 154876   62  <differential_dataflow::trace::wrappers::rc::TraceBox<K, V, T, R, Tr>>::adjust_through_frontier
 151942  165  <timely::dataflow::channels::pushers::buffer::Buffer<T, D, P>>::flush
 110488   62  differential_dataflow::trace::rc_blanket_impls::<impl differential_dataflow::trace::Batch<K, V, T, R> for alloc::rc::Rc<B>>::advance_mut
 110080   62  differential_dataflow::trace::rc_blanket_impls::<impl differential_dataflow::trace::Batch<K, V, T, R> for alloc::rc::Rc<B>>::merge
 109144  335  <timely::dataflow::operators::generic::builder_rc::OperatorBuilder<G>>::build::{{closure}}
 102932 1393  <alloc::raw_vec::RawVec<T, A>>::double

#6

Wow, this is all super insightful. Lots of stuff I had no idea about. For example (for readers who aren’t convinced)

  77971  425  timely_communication::Push::send
  74325  425  timely_communication::Push::done

This is apparently 70k lines each for two convenience methods in a trait that are simply:

    /// Pushes `element` and drops any resulting resources.
    #[inline(always)]
    fn send(&mut self, element: T) { self.push(&mut Some(element)); }
    /// Pushes `None`, conventionally signalling a flush.
    #[inline(always)]
    fn done(&mut self) { self.push(&mut None); }

No one actually needs to use them (and I’m a bit surprised that anyone does), but this really shines some light on costs I didn’t understand. Tyvm!

Edit: Even less clear to me is why they show up at all, given the #[inline(always)] attribute (and their low complexity). Any thoughts, anyone?


#7

@matklad I tried your crate and compile time in release mode drops from 40 seconds to 28 seconds with the master branch of serde_json. Stay tuned for a release!

@frankmcsherry with --release you may be seeing a combination of MIR optimizations and semantic differences between debug and release code, things like debug_assert! and arithmetic overflow behavior.


#8

It’s using --emit=llvm-ir, which IIRC is the result after LLVM optimizations. I don’t know if there’s a way to emit the initial input to LLVM.


#9

From what I can tell (unscientifically running cargo rustc -- --emit=llvm-ir -Z time-passes and watch -n.1 ls target/debug/deps) the IR file is written after “llvm module passes” but before “codegen passes”. So @cuviper is right that it is after LLVM optimizations.

  time: 1.310; rss: 281MB	translation
    time: 0.126; rss: 242MB	llvm function passes
    time: 0.069; rss: 242MB	llvm module passes
    time: 3.228; rss: 245MB	codegen passes

We should still be able to make cargo llvm-lines faster by killing the compiler as soon as the IR file is written without waiting for codegen. I filed dtolnay/cargo-llvm-lines#3 to follow up.


#10

Seems like it’s having some trouble when subcrates are involved:

$ git clone https://github.com/mit-pdos/distributary
$ cd distributary
$ cargo llvm-lines --manifest-path benchmarks/Cargo.toml --bin vote --release
   Compiling benchmarks v0.0.1 (file:///home/jon/dev/projects/distributary/benchmarks)
warning: ignoring emit path because multiple .ll files were produced
    Finished release [optimized + debuginfo] target(s) in 9.94 secs
      1    1  core::num::dec2flt::algorithm::fpu_precision::set_precision
$ cargo llvm-lines --lib --release
   Compiling distributary v0.1.0 (file:///home/jon/dev/projects/distributary)
warning: ignoring emit path because multiple .ll files were produced
    Finished release [optimized + debuginfo] target(s) in 90.79 secs
     34    1  <serde_json::read::Reference<'b, 'c, T> as core::ops::deref::Deref>::deref

#11

From the discussion in dtolnay/cargo-llvm-lines#4 the error "ignoring emit path because multiple .ll files were produced" may be a result of having CARGO_INCREMENTAL=1 environment variable set. Hopefully everything works if you unset it. I will work on a fix later this week – but I would love a PR if someone gets to it before I do!


#12

Yup, works perfectly fine if I turn off CARGO_INCREMENTAL!


#13

Huh, that’s interesting… I consistently see many more instantiations of HashMap::resize than HashMap::insert. Seems odd for there to be more of the former than the latter, no?


#14
  • HashMap::reserve has two calls to resize
    • edit: does this matter if they’re still the same monomorphization?
  • reserve is called by insert, entry, extend, from_iter

#15

I assumed that multiple calls with the same monomorphization would only count as a single instance in the output of llvm-lines. Maybe @dtolnay can shed some light?


#16

What are your actual numbers? I just tried on something using HashMap, and got exactly 1 resize, at 260 lines. So it’s somewhat big, but not repeated in my case.


#17

Running with --lib --release on distributary:

  31391   69  <std::collections::hash::map::HashMap<K, V, S>>::resize
  13901   22  <std::collections::hash::map::HashMap<K, V, S>>::insert
   4070    9  <std::collections::hash::map::HashMap<K, V, S>>::entry
   2660    7  <std::collections::hash::map::HashMap<K, V, S>>::remove
   2321    8  <std::collections::hash::map::HashMap<K, V, S>>::search
   1251    2  <std::collections::hash::map::HashMap<K, V, S>>::make_hash
   1088    4  <std::collections::hash::map::HashMap<K, V, S>>::search_mut
    845    5  <std::collections::hash::map::HashMap<K, V, S> as core::ops::index::Index<&'a Q>>::index
    563    3  <std::collections::hash::map::HashMap<K, V, S>>::get
    537    3  <std::collections::hash::map::HashMap<K, V, S>>::get_mut
    509    2  <std::collections::hash::map::HashMap<K, V, S> as core::clone::Clone>::clone
    366    2  <std::collections::hash::map::HashMap<K, V, S>>::contains_key
    297    3  <std::collections::hash::map::HashMap<K, V, S>>::reserve

#18

@jonhoo you can track down all the instantiations in the *.ll file. Here is what I found from rg '^define.*HashMap.*resize' target/release/deps/*.ll and cleaning up the output a little:

69 instantiations of HashMap::resize
HashMap<&alloc::vec::Vec<petgraph::graph_impl::NodeIndex<u32>>, ()>
HashMap<&nom_sql::column::Column, ()>
HashMap<&str, ()>
HashMap<&str, alloc::rc::Rc<core::cell::RefCell<mir::node::MirNode>>>
HashMap<(alloc::string::String, alloc::string::String), controller::sql::query_graph::QueryGraphEdge>
HashMap<(alloc::string::String, usize), alloc::rc::Rc<core::cell::RefCell<mir::node::MirNode>>>
HashMap<(dataflow::domain::Index, usize), (channel::tcp::TcpSender<dataflow::payload::Packet>, bool), core::hash::BuildHasherDefault<fnv::FnvHasher>>
HashMap<(dataflow::domain::Index, usize), (dataflow::statistics::DomainStats, std::collections::hash::map::HashMap<petgraph::graph_impl::NodeIndex<u32>, dataflow::statistics::NodeStats>)>
HashMap<(dataflow::domain::Index, usize), (std::net::addr::SocketAddr, bool)>
HashMap<(dataflow::domain::Index, usize), usize, core::hash::BuildHasherDefault<fnv::FnvHasher>>
HashMap<(petgraph::graph_impl::NodeIndex<u32>, petgraph::graph_impl::NodeIndex<u32>), ()>
HashMap<(petgraph::graph_impl::NodeIndex<u32>, petgraph::graph_impl::NodeIndex<u32>), petgraph::graph_impl::NodeIndex<u32>>
HashMap<(petgraph::graph_impl::NodeIndex<u32>, usize), (dataflow::backlog::SingleReadHandle, core::option::Option<dataflow::checktable::TokenGenerator>)>
HashMap<alloc::rc::Rc<alloc::string::String>, alloc::vec::Vec<hyper::client::pool::Entry<core::cell::RefCell<futures::sync::mpsc::Sender<(hyper::proto::MessageHead<hyper::proto::RequestLine>, core::option::Option<hyper::proto::body::Body>, futures::sync::oneshot::Sender<core::result::Result<hyper::proto::response::Response<hyper::proto::body::Body>, hyper::error::Error>>)>>>>>
HashMap<alloc::rc::Rc<alloc::string::String>, alloc::vec::Vec<hyper::client::pool::Entry<tokio_proto::util::client_proxy::ClientProxy<tokio_proto::streaming::message::Message<hyper::proto::MessageHead<hyper::proto::RequestLine>, hyper::proto::body::Body>, tokio_proto::streaming::message::Message<hyper::proto::MessageHead<hyper::proto::RawStatus>, tokio_proto::streaming::body::Body<hyper::proto::chunk::Chunk, hyper::error::Error>>, hyper::error::Error>>>>
HashMap<alloc::rc::Rc<alloc::string::String>, alloc::vec_deque::VecDeque<relay::Sender<hyper::client::pool::Entry<core::cell::RefCell<futures::sync::mpsc::Sender<(hyper::proto::MessageHead<hyper::proto::RequestLine>, core::option::Option<hyper::proto::body::Body>, futures::sync::oneshot::Sender<core::result::Result<hyper::proto::response::Response<hyper::proto::body::Body>, hyper::error::Error>>)>>>>>>
HashMap<alloc::rc::Rc<alloc::string::String>, alloc::vec_deque::VecDeque<relay::Sender<hyper::client::pool::Entry<tokio_proto::util::client_proxy::ClientProxy<tokio_proto::streaming::message::Message<hyper::proto::MessageHead<hyper::proto::RequestLine>, hyper::proto::body::Body>, tokio_proto::streaming::message::Message<hyper::proto::MessageHead<hyper::proto::RawStatus>, tokio_proto::streaming::body::Body<hyper::proto::chunk::Chunk, hyper::error::Error>>, hyper::error::Error>>>>>
HashMap<alloc::string::String, ()>
HashMap<alloc::string::String, alloc::string::String>
HashMap<alloc::string::String, alloc::vec::Vec<(usize, alloc::vec::Vec<nom_sql::column::ColumnSpecification>)>>
HashMap<alloc::string::String, alloc::vec::Vec<alloc::string::String>>
HashMap<alloc::string::String, alloc::vec::Vec<nom_sql::condition::ConditionExpression>>
HashMap<alloc::string::String, controller::sql::query_graph::QueryGraphNode>
HashMap<alloc::string::String, petgraph::graph_impl::NodeIndex<u32>>
HashMap<alloc::string::String, u64>
HashMap<alloc::string::String, usize>
HashMap<alloc::vec::Vec<core::data::DataType>, ()>
HashMap<alloc::vec::Vec<core::data::DataType>, usize>
HashMap<alloc::vec::Vec<usize>, ()>
HashMap<alloc::vec::Vec<usize>, alloc::vec::Vec<(core::local::Tag, dataflow::domain::Index)>>
HashMap<core::addressing::IndexPair, alloc::vec::Vec<usize>>
HashMap<core::addressing::IndexPair, usize>
HashMap<core::addressing::LocalNodeIndex, ()>
HashMap<core::any::TypeId, alloc::boxed::Box<hyper::header::Header>>
HashMap<core::any::TypeId, alloc::boxed::Box<hyper::header::HeaderFormat>>
HashMap<core::data::DataType, alloc::vec::Vec<alloc::vec::Vec<core::data::DataType>>, core::hash::BuildHasherDefault<fnv::FnvHasher>>
HashMap<core::data::DataType, core::local::Map<core::data::Records>>
HashMap<core::local::Tag, alloc::vec::Vec<dataflow::domain::Index>>
HashMap<core::local::Tag, alloc::vec::Vec<petgraph::graph_impl::NodeIndex<u32>>>
HashMap<core::local::Tag, petgraph::graph_impl::NodeIndex<u32>>
HashMap<dataflow::domain::Index, ()>
HashMap<dataflow::domain::Index, (std::collections::hash::map::HashMap<petgraph::graph_impl::NodeIndex<u32>, usize>, std::collections::hash::map::HashMap<petgraph::graph_impl::NodeIndex<u32>, alloc::vec::Vec<core::addressing::LocalNodeIndex>>)>
HashMap<dataflow::domain::Index, alloc::vec::Vec<(petgraph::graph_impl::NodeIndex<u32>, bool)>>
HashMap<dataflow::domain::Index, alloc::vec::Vec<petgraph::graph_impl::NodeIndex<u32>>>
HashMap<dataflow::domain::Index, controller::domain_handle::DomainHandle>
HashMap<dataflow::domain::Index, i64>
HashMap<dataflow::domain::Index, std::collections::hash::map::HashMap<petgraph::graph_impl::NodeIndex<u32>, core::addressing::IndexPair>>
HashMap<hyper::method::Method, rustful::router::variables::Variables<alloc::boxed::Box<rustful::handler::Handler>>>
HashMap<nom_sql::column::Column, ()>
HashMap<nom_sql::column::Column, alloc::vec::Vec<&nom_sql::condition::ConditionExpression>>
HashMap<petgraph::graph_impl::NodeIndex<u32>, ()>
HashMap<petgraph::graph_impl::NodeIndex<u32>, (alloc::vec::Vec<usize>, bool)>
HashMap<petgraph::graph_impl::NodeIndex<u32>, alloc::vec::Vec<core::addressing::LocalNodeIndex>>
HashMap<petgraph::graph_impl::NodeIndex<u32>, alloc::vec::Vec<usize>>
HashMap<petgraph::graph_impl::NodeIndex<u32>, core::addressing::IndexPair>
HashMap<petgraph::graph_impl::NodeIndex<u32>, dataflow::Sharding>
HashMap<petgraph::graph_impl::NodeIndex<u32>, dataflow::statistics::NodeStats>
HashMap<petgraph::graph_impl::NodeIndex<u32>, petgraph::graph_impl::NodeIndex<u32>>
HashMap<petgraph::graph_impl::NodeIndex<u32>, std::collections::hash::set::HashSet<alloc::vec::Vec<usize>>>
HashMap<petgraph::graph_impl::NodeIndex<u32>, usize>
HashMap<rustful::context::maybe_utf8::MaybeUtf8<alloc::string::String, alloc::vec::Vec<u8>>, rustful::router::tree_router::TreeRouter<rustful::router::method_router::MethodRouter<rustful::router::variables::Variables<alloc::boxed::Box<rustful::handler::Handler>>>>>
HashMap<std::net::addr::SocketAddr, controller::WorkerStatus>
HashMap<std::net::addr::SocketAddr, std::net::addr::SocketAddr>
HashMap<u64, (controller::sql::query_graph::QueryGraph, mir::query::MirQuery)>
HashMap<u64, (core::option::Option<alloc::string::String>, nom_sql::parser::SqlQuery, bool)>
HashMap<u64, futures::sync::oneshot::Sender<core::result::Result<tokio_proto::streaming::message::Message<core::result::Result<core::result::Result<dataflow::checktable::service::Response__, tarpc::errors::WireError<dataflow::checktable::service::Error__>>, alloc::boxed::Box<bincode::internal::ErrorKind>>, tokio_proto::streaming::body::Body<(), std::io::error::Error>>, std::io::error::Error>>>
HashMap<u64, tokio_proto::streaming::multiplex::advanced::Exchange<tokio_proto::streaming::multiplex::client::Dispatch<tokio_proto::simple::LiftProto<tarpc::protocol::Proto<dataflow::checktable::service::Request__, core::result::Result<dataflow::checktable::service::Response__, tarpc::errors::WireError<dataflow::checktable::service::Error__>>>>, tarpc::stream_type::StreamType, futures::stream::empty::Empty<(), std::io::error::Error>>>>
HashMap<usize, u64>
HashMap<usize, usize>
22 instantiations of HashMap::insert
HashMap<(alloc::string::String, usize), alloc::rc::Rc<core::cell::RefCell<mir::node::MirNode>>>
HashMap<(dataflow::domain::Index, usize), (dataflow::statistics::DomainStats, std::collections::hash::map::HashMap<petgraph::graph_impl::NodeIndex<u32>, dataflow::statistics::NodeStats>)>
HashMap<(dataflow::domain::Index, usize), usize, core::hash::BuildHasherDefault<fnv::FnvHasher>>
HashMap<(petgraph::graph_impl::NodeIndex<u32>, petgraph::graph_impl::NodeIndex<u32>), petgraph::graph_impl::NodeIndex<u32>>
HashMap<alloc::string::String, ()>
HashMap<alloc::string::String, controller::sql::query_graph::QueryGraphNode>
HashMap<alloc::string::String, petgraph::graph_impl::NodeIndex<u32>>
HashMap<alloc::string::String, usize>
HashMap<alloc::vec::Vec<usize>, ()>
HashMap<core::any::TypeId, alloc::boxed::Box<hyper::header::Header>>
HashMap<core::any::TypeId, alloc::boxed::Box<hyper::header::HeaderFormat>>
HashMap<dataflow::domain::Index, ()>
HashMap<dataflow::domain::Index, i64>
HashMap<nom_sql::column::Column, ()>
HashMap<petgraph::graph_impl::NodeIndex<u32>, ()>
HashMap<petgraph::graph_impl::NodeIndex<u32>, (alloc::vec::Vec<usize>, bool)>
HashMap<petgraph::graph_impl::NodeIndex<u32>, alloc::vec::Vec<core::addressing::LocalNodeIndex>>
HashMap<petgraph::graph_impl::NodeIndex<u32>, core::addressing::IndexPair>
HashMap<petgraph::graph_impl::NodeIndex<u32>, dataflow::Sharding>
HashMap<petgraph::graph_impl::NodeIndex<u32>, dataflow::statistics::NodeStats>
HashMap<petgraph::graph_impl::NodeIndex<u32>, usize>
HashMap<u64, (controller::sql::query_graph::QueryGraph, mir::query::MirQuery)>

#19

Ah, that’s useful. Still seems weird to have implementations of resize for more types than insert though?

EDIT: Also, what command do you run to generate the .ll files?


#20

Also, what command do you run to generate the .ll files?

cargo rustc --lib --release -- --emit=llvm-ir

Still seems weird to have implementations of resize for more types than insert though?

¯\_(ツ)_/¯

I just picked one from your list, HashMap::<&str, ()>::resize which is instantiated here even though insert is never instantiated. Edit: collect calls from_iter which calls extend which calls insert so I guess that has been inlined and does not count…?