Improve compile time and executable size by counting lines of LLVM IR

In my experience the bulk of Rust compile time is spent within LLVM, and that time is roughly proportional to number of lines of IR handed to LLVM. Disclaimer: not always the case but hopefully this is helpful anyway.

I made a Cargo subcommand cargo llvm-lines to count lines of LLVM IR broken down by source function. It was helpful to me today in tracking down two large improvements in serde_json compile time (1) (2) by revealing which functions contributed most of the IR or were instantiated more than expected.

The output is one line per function with three columns:

  1. Total number of lines of LLVM IR generated across all instantiations of the function.
  2. Number of instantiations of the function. For a generic function, roughly the number of distinct combinations of generic type parameters it is called with.
  3. Name of the function.

$ cargo llvm-lines | head -20

   2447  130  core::ptr::drop_in_place
   1720   19  <core::option::Option<T>>::map
    862    2  core::str::pattern::TwoWaySearcher::next
    726    4  <alloc::raw_vec::RawVec<T, A>>::double
    698    4  <alloc::raw_vec::RawVec<T, A>>::reserve
    677    6  <core::result::Result<T, E>>::map
    602    1  cargo_llvm_lines::read_llvm_ir
    598    5  <alloc::vec::Vec<T>>::extend_desugared
    477    1  cargo_llvm_lines::count_lines
    476    9  <alloc::raw_vec::RawVec<T, A>>::dealloc_buffer
    464   10  alloc::heap::box_free
    452    5  <alloc::vec::Vec<T> as alloc::vec::SpecExtend<T, I>>::spec_extend
    448    1  alloc::slice::merge_sort
    436    1  <std::process::Command as cargo_llvm_lines::PipeTo>::pipe_to
    419    4  <core::slice::Iter<'a, T> as core::iter::iterator::Iterator>::next
    400    1  <core::hash::sip::Sip13Rounds as core::hash::sip::Sip>::d_rounds
    378    9  <alloc::raw_vec::RawVec<T, A>>::current_layout
    362    3  <alloc::raw_vec::RawVec<T, A>>::allocate_in
    354    4  <alloc::vec::Vec<T>>::push
    341    4  <[T] as core::slice::SliceExt>::iter
20 Likes

Super helpful! Just run this on some of my code, and looks like the main contributor is serde_json, so I am looking forward to those improvements :smile:

λ cargo llvm-lines | head -3
   Compiling lang_fall v0.1.0 (file:///home/matklad/projects/fall/lang/fall)
    Finished dev [unoptimized + debuginfo] target(s) in 14.42 secs
  41272   31  <serde_json::de::Deserializer<R>>::parse_value
   6931   16  <serde_json::de::SeqAccess<'a, R> as serde::de::SeqAccess<'de>>::next_element_seed
   6008  403  core::ptr::drop_in_place
2 Likes

How can you tell a crate is really production ready? – The maintainers already implemented all the features, optimized the performance and are now optimizing the compilation speed :open_mouth:

(Also, I assume this will reduce binary size? In that case, I'm even more happy to see this!)

12 Likes

This is super awesome. Compile times are for me the absolute worst (other parts of Rust are much better, let's say). I'm totally up for tools to improve this! :smiley:

I'm just giving it a whirl and it seems like it still takes the full compile time to get the pre-LLVM line counts out; is that fundamental somehow? I'm looking at a binary that takes 20mins+ to build in release, and it seems to do a full debug build (352s). Some pretty mental results, though... Awesome data!

Echidnatron% time cargo llvm-lines
   Compiling tpchlike v0.1.0 (file:///Users/mcsherry/Projects/differential-dataflow/tpchlike)
    Finished dev [unoptimized + debuginfo] target(s) in 352.27 secs
1332608 2031  <core::slice::Iter<'a, T>>::search_while
 440031 4621  <core::option::Option<T>>::map
 419929 21396  core::ptr::drop_in_place
 347867  745  alloc::slice::merge_sort
 306676 1394  <alloc::raw_vec::RawVec<T, A>>::double
 219402  745  alloc::slice::insert_head
 208600  745  alloc::slice::merge
3 Likes

Also, I'm a bit ignorant here, but: can you explain the difference between

cargo llvm-lines
cargo llvm-lines --release

I had imagined that optimization went on in LLVM rather than beforehand, but I def get different results here (lots more of my code, rather than Rust's). Are they doing some inlining pre-LLVM, perhaps?

Echidnatron% time cargo llvm-lines --release
   Compiling tpchlike v0.1.0 (file:///Users/mcsherry/Projects/differential-dataflow/tpchlike)
    Finished release [optimized] target(s) in 1405.18 secs
 634792 12661  core::ptr::drop_in_place
 387275  330  <timely::dataflow::operators::generic::builder_raw::OperatorCore<T, PEP, PIP> as timely::progress::operate::Operate<T>>::pull_internal_progress
 280709   53  <differential_dataflow::collection::Collection<G, (K, V), R> as differential_dataflow::operators::arrange::Arrange<G, K, V, R, T>>::arrange
 161055   96  <differential_dataflow::operators::join::Deferred<K, V1, V2, T, R1, R2, R3, C1, C2, M, D>>::work
 157822   62  <differential_dataflow::trace::wrappers::rc::TraceBox<K, V, T, R, Tr>>::adjust_advance_frontier
 154876   62  <differential_dataflow::trace::wrappers::rc::TraceBox<K, V, T, R, Tr>>::adjust_through_frontier
 151942  165  <timely::dataflow::channels::pushers::buffer::Buffer<T, D, P>>::flush
 110488   62  differential_dataflow::trace::rc_blanket_impls::<impl differential_dataflow::trace::Batch<K, V, T, R> for alloc::rc::Rc<B>>::advance_mut
 110080   62  differential_dataflow::trace::rc_blanket_impls::<impl differential_dataflow::trace::Batch<K, V, T, R> for alloc::rc::Rc<B>>::merge
 109144  335  <timely::dataflow::operators::generic::builder_rc::OperatorBuilder<G>>::build::{{closure}}
 102932 1393  <alloc::raw_vec::RawVec<T, A>>::double

Wow, this is all super insightful. Lots of stuff I had no idea about. For example (for readers who aren't convinced)

  77971  425  timely_communication::Push::send
  74325  425  timely_communication::Push::done

This is apparently 70k lines each for two convenience methods in a trait that are simply:

    /// Pushes `element` and drops any resulting resources.
    #[inline(always)]
    fn send(&mut self, element: T) { self.push(&mut Some(element)); }
    /// Pushes `None`, conventionally signalling a flush.
    #[inline(always)]
    fn done(&mut self) { self.push(&mut None); }

No one actually needs to use them (and I'm a bit surprised that anyone does), but this really shines some light on costs I didn't understand. Tyvm!

Edit: Even less clear to me is why they show up at all, given the #[inline(always)] attribute (and their low complexity). Any thoughts, anyone?

@matklad I tried your crate and compile time in release mode drops from 40 seconds to 28 seconds with the master branch of serde_json. Stay tuned for a release!

@frankmcsherry with --release you may be seeing a combination of MIR optimizations and semantic differences between debug and release code, things like debug_assert! and arithmetic overflow behavior.

3 Likes

It's using --emit=llvm-ir, which IIRC is the result after LLVM optimizations. I don't know if there's a way to emit the initial input to LLVM.

2 Likes

From what I can tell (unscientifically running cargo rustc -- --emit=llvm-ir -Z time-passes and watch -n.1 ls target/debug/deps) the IR file is written after "llvm module passes" but before "codegen passes". So @cuviper is right that it is after LLVM optimizations.

  time: 1.310; rss: 281MB	translation
    time: 0.126; rss: 242MB	llvm function passes
    time: 0.069; rss: 242MB	llvm module passes
    time: 3.228; rss: 245MB	codegen passes

We should still be able to make cargo llvm-lines faster by killing the compiler as soon as the IR file is written without waiting for codegen. I filed dtolnay/cargo-llvm-lines#3 to follow up.

Seems like it's having some trouble when subcrates are involved:

$ git clone https://github.com/mit-pdos/distributary
$ cd distributary
$ cargo llvm-lines --manifest-path benchmarks/Cargo.toml --bin vote --release
   Compiling benchmarks v0.0.1 (file:///home/jon/dev/projects/distributary/benchmarks)
warning: ignoring emit path because multiple .ll files were produced
    Finished release [optimized + debuginfo] target(s) in 9.94 secs
      1    1  core::num::dec2flt::algorithm::fpu_precision::set_precision
$ cargo llvm-lines --lib --release
   Compiling distributary v0.1.0 (file:///home/jon/dev/projects/distributary)
warning: ignoring emit path because multiple .ll files were produced
    Finished release [optimized + debuginfo] target(s) in 90.79 secs
     34    1  <serde_json::read::Reference<'b, 'c, T> as core::ops::deref::Deref>::deref
1 Like

From the discussion in dtolnay/cargo-llvm-lines#4 the error "ignoring emit path because multiple .ll files were produced" may be a result of having CARGO_INCREMENTAL=1 environment variable set. Hopefully everything works if you unset it. I will work on a fix later this week -- but I would love a PR if someone gets to it before I do!

1 Like

Yup, works perfectly fine if I turn off CARGO_INCREMENTAL!

Huh, that's interesting.. I consistently see many more instantiations of HashMap::resize than HashMap::insert. Seems odd for there to be more of the former than the latter, no?

  • HashMap::reserve has two calls to resize
    • edit: does this matter if they're still the same monomorphization?
  • reserve is called by insert, entry, extend, from_iter

I assumed that multiple calls with the same monomorphization would only count as a single instance in the output of llvm-lines. Maybe @dtolnay can shed some light?

What are your actual numbers? I just tried on something using HashMap, and got exactly 1 resize, at 260 lines. So it's somewhat big, but not repeated in my case.

Running with --lib --release on distributary:

  31391   69  <std::collections::hash::map::HashMap<K, V, S>>::resize
  13901   22  <std::collections::hash::map::HashMap<K, V, S>>::insert
   4070    9  <std::collections::hash::map::HashMap<K, V, S>>::entry
   2660    7  <std::collections::hash::map::HashMap<K, V, S>>::remove
   2321    8  <std::collections::hash::map::HashMap<K, V, S>>::search
   1251    2  <std::collections::hash::map::HashMap<K, V, S>>::make_hash
   1088    4  <std::collections::hash::map::HashMap<K, V, S>>::search_mut
    845    5  <std::collections::hash::map::HashMap<K, V, S> as core::ops::index::Index<&'a Q>>::index
    563    3  <std::collections::hash::map::HashMap<K, V, S>>::get
    537    3  <std::collections::hash::map::HashMap<K, V, S>>::get_mut
    509    2  <std::collections::hash::map::HashMap<K, V, S> as core::clone::Clone>::clone
    366    2  <std::collections::hash::map::HashMap<K, V, S>>::contains_key
    297    3  <std::collections::hash::map::HashMap<K, V, S>>::reserve

@jonhoo you can track down all the instantiations in the *.ll file. Here is what I found from rg '^define.*HashMap.*resize' target/release/deps/*.ll and cleaning up the output a little:

69 instantiations of HashMap::resize
HashMap<&alloc::vec::Vec<petgraph::graph_impl::NodeIndex<u32>>, ()>
HashMap<&nom_sql::column::Column, ()>
HashMap<&str, ()>
HashMap<&str, alloc::rc::Rc<core::cell::RefCell<mir::node::MirNode>>>
HashMap<(alloc::string::String, alloc::string::String), controller::sql::query_graph::QueryGraphEdge>
HashMap<(alloc::string::String, usize), alloc::rc::Rc<core::cell::RefCell<mir::node::MirNode>>>
HashMap<(dataflow::domain::Index, usize), (channel::tcp::TcpSender<dataflow::payload::Packet>, bool), core::hash::BuildHasherDefault<fnv::FnvHasher>>
HashMap<(dataflow::domain::Index, usize), (dataflow::statistics::DomainStats, std::collections::hash::map::HashMap<petgraph::graph_impl::NodeIndex<u32>, dataflow::statistics::NodeStats>)>
HashMap<(dataflow::domain::Index, usize), (std::net::addr::SocketAddr, bool)>
HashMap<(dataflow::domain::Index, usize), usize, core::hash::BuildHasherDefault<fnv::FnvHasher>>
HashMap<(petgraph::graph_impl::NodeIndex<u32>, petgraph::graph_impl::NodeIndex<u32>), ()>
HashMap<(petgraph::graph_impl::NodeIndex<u32>, petgraph::graph_impl::NodeIndex<u32>), petgraph::graph_impl::NodeIndex<u32>>
HashMap<(petgraph::graph_impl::NodeIndex<u32>, usize), (dataflow::backlog::SingleReadHandle, core::option::Option<dataflow::checktable::TokenGenerator>)>
HashMap<alloc::rc::Rc<alloc::string::String>, alloc::vec::Vec<hyper::client::pool::Entry<core::cell::RefCell<futures::sync::mpsc::Sender<(hyper::proto::MessageHead<hyper::proto::RequestLine>, core::option::Option<hyper::proto::body::Body>, futures::sync::oneshot::Sender<core::result::Result<hyper::proto::response::Response<hyper::proto::body::Body>, hyper::error::Error>>)>>>>>
HashMap<alloc::rc::Rc<alloc::string::String>, alloc::vec::Vec<hyper::client::pool::Entry<tokio_proto::util::client_proxy::ClientProxy<tokio_proto::streaming::message::Message<hyper::proto::MessageHead<hyper::proto::RequestLine>, hyper::proto::body::Body>, tokio_proto::streaming::message::Message<hyper::proto::MessageHead<hyper::proto::RawStatus>, tokio_proto::streaming::body::Body<hyper::proto::chunk::Chunk, hyper::error::Error>>, hyper::error::Error>>>>
HashMap<alloc::rc::Rc<alloc::string::String>, alloc::vec_deque::VecDeque<relay::Sender<hyper::client::pool::Entry<core::cell::RefCell<futures::sync::mpsc::Sender<(hyper::proto::MessageHead<hyper::proto::RequestLine>, core::option::Option<hyper::proto::body::Body>, futures::sync::oneshot::Sender<core::result::Result<hyper::proto::response::Response<hyper::proto::body::Body>, hyper::error::Error>>)>>>>>>
HashMap<alloc::rc::Rc<alloc::string::String>, alloc::vec_deque::VecDeque<relay::Sender<hyper::client::pool::Entry<tokio_proto::util::client_proxy::ClientProxy<tokio_proto::streaming::message::Message<hyper::proto::MessageHead<hyper::proto::RequestLine>, hyper::proto::body::Body>, tokio_proto::streaming::message::Message<hyper::proto::MessageHead<hyper::proto::RawStatus>, tokio_proto::streaming::body::Body<hyper::proto::chunk::Chunk, hyper::error::Error>>, hyper::error::Error>>>>>
HashMap<alloc::string::String, ()>
HashMap<alloc::string::String, alloc::string::String>
HashMap<alloc::string::String, alloc::vec::Vec<(usize, alloc::vec::Vec<nom_sql::column::ColumnSpecification>)>>
HashMap<alloc::string::String, alloc::vec::Vec<alloc::string::String>>
HashMap<alloc::string::String, alloc::vec::Vec<nom_sql::condition::ConditionExpression>>
HashMap<alloc::string::String, controller::sql::query_graph::QueryGraphNode>
HashMap<alloc::string::String, petgraph::graph_impl::NodeIndex<u32>>
HashMap<alloc::string::String, u64>
HashMap<alloc::string::String, usize>
HashMap<alloc::vec::Vec<core::data::DataType>, ()>
HashMap<alloc::vec::Vec<core::data::DataType>, usize>
HashMap<alloc::vec::Vec<usize>, ()>
HashMap<alloc::vec::Vec<usize>, alloc::vec::Vec<(core::local::Tag, dataflow::domain::Index)>>
HashMap<core::addressing::IndexPair, alloc::vec::Vec<usize>>
HashMap<core::addressing::IndexPair, usize>
HashMap<core::addressing::LocalNodeIndex, ()>
HashMap<core::any::TypeId, alloc::boxed::Box<hyper::header::Header>>
HashMap<core::any::TypeId, alloc::boxed::Box<hyper::header::HeaderFormat>>
HashMap<core::data::DataType, alloc::vec::Vec<alloc::vec::Vec<core::data::DataType>>, core::hash::BuildHasherDefault<fnv::FnvHasher>>
HashMap<core::data::DataType, core::local::Map<core::data::Records>>
HashMap<core::local::Tag, alloc::vec::Vec<dataflow::domain::Index>>
HashMap<core::local::Tag, alloc::vec::Vec<petgraph::graph_impl::NodeIndex<u32>>>
HashMap<core::local::Tag, petgraph::graph_impl::NodeIndex<u32>>
HashMap<dataflow::domain::Index, ()>
HashMap<dataflow::domain::Index, (std::collections::hash::map::HashMap<petgraph::graph_impl::NodeIndex<u32>, usize>, std::collections::hash::map::HashMap<petgraph::graph_impl::NodeIndex<u32>, alloc::vec::Vec<core::addressing::LocalNodeIndex>>)>
HashMap<dataflow::domain::Index, alloc::vec::Vec<(petgraph::graph_impl::NodeIndex<u32>, bool)>>
HashMap<dataflow::domain::Index, alloc::vec::Vec<petgraph::graph_impl::NodeIndex<u32>>>
HashMap<dataflow::domain::Index, controller::domain_handle::DomainHandle>
HashMap<dataflow::domain::Index, i64>
HashMap<dataflow::domain::Index, std::collections::hash::map::HashMap<petgraph::graph_impl::NodeIndex<u32>, core::addressing::IndexPair>>
HashMap<hyper::method::Method, rustful::router::variables::Variables<alloc::boxed::Box<rustful::handler::Handler>>>
HashMap<nom_sql::column::Column, ()>
HashMap<nom_sql::column::Column, alloc::vec::Vec<&nom_sql::condition::ConditionExpression>>
HashMap<petgraph::graph_impl::NodeIndex<u32>, ()>
HashMap<petgraph::graph_impl::NodeIndex<u32>, (alloc::vec::Vec<usize>, bool)>
HashMap<petgraph::graph_impl::NodeIndex<u32>, alloc::vec::Vec<core::addressing::LocalNodeIndex>>
HashMap<petgraph::graph_impl::NodeIndex<u32>, alloc::vec::Vec<usize>>
HashMap<petgraph::graph_impl::NodeIndex<u32>, core::addressing::IndexPair>
HashMap<petgraph::graph_impl::NodeIndex<u32>, dataflow::Sharding>
HashMap<petgraph::graph_impl::NodeIndex<u32>, dataflow::statistics::NodeStats>
HashMap<petgraph::graph_impl::NodeIndex<u32>, petgraph::graph_impl::NodeIndex<u32>>
HashMap<petgraph::graph_impl::NodeIndex<u32>, std::collections::hash::set::HashSet<alloc::vec::Vec<usize>>>
HashMap<petgraph::graph_impl::NodeIndex<u32>, usize>
HashMap<rustful::context::maybe_utf8::MaybeUtf8<alloc::string::String, alloc::vec::Vec<u8>>, rustful::router::tree_router::TreeRouter<rustful::router::method_router::MethodRouter<rustful::router::variables::Variables<alloc::boxed::Box<rustful::handler::Handler>>>>>
HashMap<std::net::addr::SocketAddr, controller::WorkerStatus>
HashMap<std::net::addr::SocketAddr, std::net::addr::SocketAddr>
HashMap<u64, (controller::sql::query_graph::QueryGraph, mir::query::MirQuery)>
HashMap<u64, (core::option::Option<alloc::string::String>, nom_sql::parser::SqlQuery, bool)>
HashMap<u64, futures::sync::oneshot::Sender<core::result::Result<tokio_proto::streaming::message::Message<core::result::Result<core::result::Result<dataflow::checktable::service::Response__, tarpc::errors::WireError<dataflow::checktable::service::Error__>>, alloc::boxed::Box<bincode::internal::ErrorKind>>, tokio_proto::streaming::body::Body<(), std::io::error::Error>>, std::io::error::Error>>>
HashMap<u64, tokio_proto::streaming::multiplex::advanced::Exchange<tokio_proto::streaming::multiplex::client::Dispatch<tokio_proto::simple::LiftProto<tarpc::protocol::Proto<dataflow::checktable::service::Request__, core::result::Result<dataflow::checktable::service::Response__, tarpc::errors::WireError<dataflow::checktable::service::Error__>>>>, tarpc::stream_type::StreamType, futures::stream::empty::Empty<(), std::io::error::Error>>>>
HashMap<usize, u64>
HashMap<usize, usize>
22 instantiations of HashMap::insert
HashMap<(alloc::string::String, usize), alloc::rc::Rc<core::cell::RefCell<mir::node::MirNode>>>
HashMap<(dataflow::domain::Index, usize), (dataflow::statistics::DomainStats, std::collections::hash::map::HashMap<petgraph::graph_impl::NodeIndex<u32>, dataflow::statistics::NodeStats>)>
HashMap<(dataflow::domain::Index, usize), usize, core::hash::BuildHasherDefault<fnv::FnvHasher>>
HashMap<(petgraph::graph_impl::NodeIndex<u32>, petgraph::graph_impl::NodeIndex<u32>), petgraph::graph_impl::NodeIndex<u32>>
HashMap<alloc::string::String, ()>
HashMap<alloc::string::String, controller::sql::query_graph::QueryGraphNode>
HashMap<alloc::string::String, petgraph::graph_impl::NodeIndex<u32>>
HashMap<alloc::string::String, usize>
HashMap<alloc::vec::Vec<usize>, ()>
HashMap<core::any::TypeId, alloc::boxed::Box<hyper::header::Header>>
HashMap<core::any::TypeId, alloc::boxed::Box<hyper::header::HeaderFormat>>
HashMap<dataflow::domain::Index, ()>
HashMap<dataflow::domain::Index, i64>
HashMap<nom_sql::column::Column, ()>
HashMap<petgraph::graph_impl::NodeIndex<u32>, ()>
HashMap<petgraph::graph_impl::NodeIndex<u32>, (alloc::vec::Vec<usize>, bool)>
HashMap<petgraph::graph_impl::NodeIndex<u32>, alloc::vec::Vec<core::addressing::LocalNodeIndex>>
HashMap<petgraph::graph_impl::NodeIndex<u32>, core::addressing::IndexPair>
HashMap<petgraph::graph_impl::NodeIndex<u32>, dataflow::Sharding>
HashMap<petgraph::graph_impl::NodeIndex<u32>, dataflow::statistics::NodeStats>
HashMap<petgraph::graph_impl::NodeIndex<u32>, usize>
HashMap<u64, (controller::sql::query_graph::QueryGraph, mir::query::MirQuery)>
1 Like

Ah, that's useful. Still seems weird to have implementations of resize for more types than insert though?

EDIT: Also, what command do you run to generate the .ll files?

Also, what command do you run to generate the .ll files?

cargo rustc --lib --release -- --emit=llvm-ir

Still seems weird to have implementations of resize for more types than insert though?

¯\_(ツ)_/¯

I just picked one from your list, HashMap::<&str, ()>::resize which is instantiated here even though insert is never instantiated. Edit: collect calls from_iter which calls extend which calls insert so I guess that has been inlined and does not count..?

2 Likes