Reducing binary size through shared library

First post here, and apologies in advance for what may be a noob question. I'd like to better understand how shared libraries can be used to reduce overall binary size.

For context, I'm developing for a constrained environment with a couple of MBs of available storage, so keeping binary size in check is a priority. I'd like to run a few concurrent processes that share a common set of interfaces, with each performing some logic. In this particular case, all processes connect to an MQTT broker, and do work on different topics.

My original vision for this was to have the MQTT interface and connection/publish/subscribe mechanics abstracted away in a shared library. And then have separate binaries that use that library for the MQTT interface, and themselves only contain their specific (and quite minimal) logic.

I've had a go at implementing this, following a structure along the lines of rust - Package with both a library and a binary? - Stack Overflow. I've taken things to "the extreme", putting virtually all code in the library - each of the binaries then only import and call a single function from the library. I've also applied the main points from GitHub - johnthagen/min-sized-rust: 🦀 How to minimize Rust binary size 📦 to reduce binary size (up till and including "Abort on Panic").

The challenge I have is this: what I observe is that the binaries are still 400-500KB each - not much less than the size before I moved the logic into the library. The library is just ~70KB. This doesn't make a lot of sense to me, but maybe I'm missing something? Maybe there's a sort of lower bound on the binary size that I'm just not going to get below (at least for code based on std)? Happy to share my code if helpful.

If I can't get the binaries to be below ~100KB each via this "shared library" approach, then I'm probably going to need to tackle this a different way - e.g. put everything into a single multi-threaded executable. Any pointers?

1 Like

You need to figure out where bytes are going. 400 KB is definitely not a lower bound for Rust binaries. I recommend trying cargo bloat.

Thanks for the tip @sanxiyn! I tried cargo bloat and it revealed an interesting result:

$ cargo bloat --release -n 10 --bin pub1
    Finished release [optimized] target(s) in 0.03s
    Analyzing target/release/pub1

 File  .text     Size     Crate Name
61.0% 101.0% 527.0KiB [Unknown] __mh_execute_header
 0.0%   0.0%       0B           And 0 smaller methods. Use -n N to show more.
60.3% 100.0% 521.7KiB           .text section size, the file size is 864.4KiB

Before I go into the results, I wanted to share a bit more of my context/process. I've published the code for my minimal example here: GitHub - svet-b/rust-mqtt-example. As I described, I have a lib.rs, and two binaries (pub1.rs and pub2.rs) - each of these contains nothing more than a call to a function contained in lib.rs). As a "control" I've also added a hello.rs binary that just does "Hello World!". After running cargo +nightly build --release the binary sizes are:

    50752 Jan 31 15:23 libmqttexample.rlib
   885144 Jan 31 15:22 pub1
   885144 Jan 31 15:22 pub2
   279560 Jan 31 15:23 hello

This is including the optimizations mentioned above, which can be seen in my Cargo.toml.

These file sizes are actually larger than the ones I mentioned originally. That's because I switched from the paho-mqtt library to the Rust-native rumqttc. AFAICT the former is just a Rust wrapper for the paho-mqtt C library, so I wanted to exclude this as a potential source of weirdness.

Finally, I also tried building with RUSTFLAGS='-C prefer-dynamic'. This required me to turn off LTO, and actually resulted in even larger binaries. Either way, it didn't help shrink the binaries (at the expense of a larger library).

Now, back to the cargo bloat output. it seems I just have a whopping big, __mh_execute_header, followed by a whopping big .text. I'm not totally clear on what these are or what they do. I did inspect the binaries with xxd, and indeed they contain a variety of function names and other symbols. I've confirmed that the files have been stripped.

Any ideas what's going on? Specifically, why the binaries are so large when the library is not? By the way, for now I'm building these on an ARM64 (Mac M1) target - not sure if that matters.

libmqttexample.rlib is a static library, so all code in it will be linked directly into the executables.

This is because you are stripping the symbols from the executables. This makes it impossible for cargo bloat to attribute code to specific functions. You can use strip = "debuginfo" to only strip debuginfo and not symbols.

3 Likes

Thanks for the pointer @bjorn3 - that's an interesting side effect that I hadn't considered. Unfortunately setting strip = "debuginfo" (or turning it off altogether) doesn't appear to result in significantly different behavior. I.e. it appears that something else is still causing static linking of the library.

I played around with a few different settings; as a specific example, in my current Cargo.toml I've set strip = "debuginfo" and turned off the other size optimizations just to make sure there isn't interference. Building with RUSTFLAGS='-C prefer-dynamic' cargo build --release still yields what appear to be statically linked binaries that are over 1MB in size. cargo bloat output is

$ RUSTFLAGS='-C prefer-dynamic' cargo bloat --release -n 10 --bin pub2
    Finished release [optimized] target(s) in 0.02s
    Analyzing target/release/pub2

 File  .text     Size     Crate Name
 2.2%   4.3%  33.0KiB [Unknown] _ge_double_scalarmult_vartime
 1.1%   2.1%  16.2KiB      ring _GFp_x25519_ge_frombytes_vartime
 0.9%   1.8%  13.7KiB [Unknown] _fe_loose_invert
 0.5%   0.9%   6.8KiB   rumqttc <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
 0.4%   0.8%   6.4KiB    rustls rustls::client::hs::emit_client_hello_for_retry
 0.4%   0.8%   5.8KiB      ring _GFp_x25519_ge_scalarmult_base
 0.4%   0.7%   5.6KiB      ring _GFp_x25519_scalar_mult_generic_masked
 0.4%   0.7%   5.5KiB    rustls <rustls::client::tls13::ExpectFinished as rustls::client::hs::State>::handle
 0.4%   0.7%   5.4KiB   rumqttc <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
 0.3%   0.7%   5.1KiB    rustls <rustls::msgs::enums::CipherSuite as core::fmt::Debug>::fmt
44.3%  87.0% 660.4KiB           And 2176 smaller methods. Use -n N to show more.
51.0% 100.0% 759.4KiB           .text section size, the file size is 1.5MiB

I now understand that the .rlib is a static library, and therefore not really relevant for this endeavor. Having read Linkage - The Rust Reference makes me think that I potentially need to split the library into a separate crate that has crate_type = "dylib", and link the binaries against that "manually". Is that what's required? Is there potentially a simpler way? I've skimmed a few threads on this forum (and elsewhere) about dynamic libraries, but haven't yet found a straightforward answer.

Adding

[lib]
crate-type = ["dylib"]

to your Cargo.toml should be enough to compile lib.rs as a dynamic library and then dynamically link against it. I think cargo will default to -Cprefer-dynamic if any crate is only available as dylib, which means you will have to ship libstd-*.so from your rust installation together with your programs. You can find it in the directory returned by rustc --print target-libdir.

4 Likes

Thanks @bjorn3 - that was basically it.

Adding

[lib]
crate-type = ["dylib"]

to my Cargo.toml, turning off LTO and panic = "abort" (neither of which played nice with that configuration), and compiling with RUSTFLAGS='-C prefer-dynamic' finally created a 2.7MB libmqttexample.dylib, as well as two ~30KB binaries :tada:.

I'm sure there are further opportunities for optimization, but the basic idea is clear now. By the way, just setting crate-type = ["dylib"] and trying to build without explicitly setting RUSTFLAGS='-C prefer-dynamic' led to a bunch of errors along the lines of

error: cannot satisfy dependencies so `std` only shows up once

Following the advice in rust - error: cannot satisfy dependencies so `std` only shows up once - Stack Overflow I instead set crate-type = ["rlib", "dylib"] - but while that makes the errors go away, it does not actually lead to a dynamically-linked binaries.

3 Likes

Wanted to follow up quickly, as I had the chance to do a bit more digging with respect to actually minimizing overall size of the application. I've tried to optimize things as much as possible, but unfortunately the shared library approach is still somewhat problematic in terms of size.

Even with all (to my knowledge) available optimizations, I can't get my shared library to be smaller than about 2.6MB. [This is for paho-mqtt which is itself smaller than rumqttc that I mentioned in some previous posts.] Worth noting that the executables do not appear to have a dependency on libstd-*.so, so I assume that my libmqttexample.dylib incorporates the relevant parts of libstd.

On the other hand building static executables with the same set of optimizations gets them to be ~500KB. Or 300-400KB by turning on more aggressive optimizations (specifically stripping) that are not available for the shared-library scenario.

What I'm picking up is that binary size optimization isn't really as much of a thing when it comes to building dynamically linked libraries. Which is OK - I understand it's kind of a different usage scenario. For my use case it seems that the most appropriate solution would be a single, statically-linked, multi-threaded executable.

Finally, if anyone is curious to take a look, this is the current state of my repo and configuration, and I'm building it all with

cargo +nightly build -Z build-std=std,panic_abort --release --target=aarch64-apple-darwin

Indeed. In fact it contains the entirety of libstd and all other dependencies of the library including those parts you don't use, as someone else who needs those parts may link against libmqttexample.dylib and expect them to be provided.

4 Likes

Write this code to .toml file

[profile.release]
opt-level = "z"

this param is for optimization project size

One tangential (?) comment — you keep saying “multi threaded binary”. That is certainly an option, but a possibly even simpler solution is a binary that, given several different flags or command names runs a different operation.

Having done that, just run that binary 3 times:

mybin thing1
mybin thing2
mybin thing3

If you like, you might also look into how multi purpose binaries make use of hard links to change behavior. (This trick might be excessively “clever”, and seems less popular than it once was… :man_facepalming:)

2 Likes

Hm yeah that makes a lot of sense. I assume there isn't really a way to "take out" the bits that aren't used by the executables I'm building against this dylib? I'm obviously not looking to distribute this more broadly.

Thanks for the tip @DanyalMh - that's actually been in there from the start (link to Cargo.toml). Interestingly, I just experimented with commenting out that line, and it reduced the size of the dylib from 2.6MB to 2.0MB! Meanwhile executable sizes remained as they were. Makes me think that it might be worth experimenting a bit with the other options - which seem to be primarily geared towards shrinking executables rather than libraries.

Thanks @uberjay that's actually exactly the kind of input/ideas I was looking for. Not sure why that didn't occur to me, but it could be a really nice solution. Indeed, having these be independent processes managed by the kernel appeals to me more than them being threads within a single executable.

I suppose one tradeoff to consider in this scenario is that each process would be independently loaded in RAM - so would need to make sure that doesn't create undue pressure. If I'm not mistaken, having multiple processes linked against a shared library would load a single shared copy of the library in RAM, so would be more efficient from that perspective. I'm not super-familiar with the ins and outs of this so may need to prod it further.

This is not necessarily true: The operating system may choose to load the binary a single time and map the single physical address region into each process's virtual address space.

I don't know how common this is in practice; you may be able to encourage this behavior by starting a single process and then using fork() to create the other processes.

4 Likes

Not natively. As a hack you could get the linker args used by rustc for the dylib and pass -Csave-temps to prevent rustc removing linker inputs. Then you can invoke the linker once and compile your binaries. Next you can get the list of symbols needed by your binaries and then link the dylib again with the list of exported symbols changed to exclude the unused symbols. The --gc-sections passed by rustc by default will then remove all unused functions. You could also omit the metadata object to save some extra space. This object file is only necessary for rustc to compile against the dylib, it isn't necessary at runtime. This is a big hack though and not necessarily easy to do.

I believe LLVM is a bit too eager to outline certain intrinsics, which can cost binary size due to needing extra code to setup the call frame.

Thanks for the pointers - these are really interesting directions to look into! Both with respect to memory management (my initial picture of how this works was clearly very naive!), as well as "library slimming" (though I'm averse to really going off the rails here - I do need something that's reasonably likely to work reliably).

Yep, the opt-level documentation itself also alludes to this:

It is recommended to experiment with different levels to find the right balance for your project. There may be surprising results, such as level 3 being slower than 2 , or the "s" and "z" levels not being necessarily smaller.

After trying all the levels, 2 actually yielded the smallest library size (though only a bit smaller than 3)

1 Like

I suspect that this depends greatly on whether you're on something with modification/execution protection on pages.

They probably had to be loaded in different places back when executables modifying their own code was common, but the modern "no, you just can't do that -- make new pages if you must" should let every instance of the executable share memory.

Of course, if you're on a chip too small to have such protection...

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.