LTO option not applied when linked with a library?

Hi Rustaceans !

I have refactored my learning project. It was originally two rs files compiled and linked together. After refactoring, I have one library, and several binaries. All is cleaner now, but I noticed very lower performance on my binaries. After digging, I noticed that :

[profile.release]
lto = true

This flags previously gave me almost 20% speedup, but now, with a library, I don't observe performance change (improvement) with or without the lto = true flag.
Here is a copy of my Cargo.toml file if it can help:

[package]
name = "bigint"
version = "0.1.0"
authors = ["Bruno"]
edition = "2018"

[profile.release]
lto = true

[lib]
name = "bigint"
path = "src/libsrc/bigint.rs"

[[bin]]
name = "main"
path = "src/main.rs"

[[bin]]
name = "pollard_rho"
path = "src/bin/pollard_rho.rs"

[[bin]]
name = "pollard_rho_brent"
path = "src/bin/pollard_rho_brent.rs"

[[bin]]
name = "pollard_rho_p_minus_1"
path = "src/bin/pollard_rho_p_minus_1.rs"

[dependencies]

Is there a logical reason for this like 'LTO optimisation that can't be applied on such libs' ?
Does anyone already encounter such a beahvior ?

Regards

You don't link on building libraries. Linking is performed by the end binary.

Yes , I mean : does the problem arise because I don't link anymore two objects files, but objects files AND a library ?

[I have modified the thread title in order to get it more explicit.]

Rust uses static linking, so libraries are just a bunch of un-linked object files.

[OFF TOPIC] I confirm I looked at the .rlib which is an ar archive containing several .o files. These .o files are recognized by the Linux file command as

'bigint-2e8e0b1cf6ce7032.bigint.e7igh6ql-cgu.7.rcgu.o: LLVM IR bitcode'

[/OFF TOPIC]
I also perform tests, and by including (duplicating) the library sources, into an executable code, I managed to get the original performances, and a consistent behavior of performance in respect with lot flag.

I have the feeling that I'm missing something obvious !

Adding codegen-units=1 to the [profile.release] section, and #[inline] attributes on key public library functions, can sometimes help optimization a lot when code is split into multiple crates.

LTO would ideally make these things unnecessary, but in practice it seems to be less than perfectly effective.

It greatly improves the situation but doesn't reach the 'library-less' efficiency (9% less). I also mention that the library and the binaries are within the same crate. It looks like the compiler is more capable of discovering optimization opportunity into a single file rather several ones [I read somewhere that mod xxx was like including module source code into current source]. Could it be true ?

This loss of performance was not acceptable from my point of view, so I dug more. Here is what I found, I write it here, since it maybe helpful for others.

When building the library, cargo uses the -C linker-plugin-lto with rustc. When building the executable cargo uses: -C lto. This mix leads to a bad performances on the generated executable.

By compiling manually the library, and the executable with -C lto option, I managed to get the original performances (on all build commands I used -C opt-level=3).

I read the role of this option, and understand it (I think :slight_smile:) .

Now my question is how to force cargo to link the library with the -C lto option ?

Interesting! My understanding is that -C lto on the library should make no difference, since the generated machine code for the library is not even used when compiling an LTO executable. That doesn't seem to be the case in your project, but I'm not sure why. It might be worth filing an issue on the Cargo repo, especially if you can provide a test project that demonstrates the problem.

The thing here is that cargo uses -Clinker-plugin-lto for the library. This causes the rlib to contain object files with only bitcode. This is incompatible with regular -Clto performed by rustc.

1 Like

After further testing, I managed to get the same result, only from command line. So what I get from cargo is consistent with what I get from command line:

Here is the Ok version, and it's compilation commands: I get 35 seconds on execution.

rustc --crate-type=lib ./bigint.rs --edition=2018 -C opt-level=3 -Clto
rustc --crate-name pollard_rho_brent --edition=2018 pollard_rho_brent.rs --extern bigint=/home/bruno/LTO-Pb/libbigint.rlib  -C opt-level=3 -C lto
 

./pollard_rho_brent 55725847542240675628248869327
Trying to factorize 55725847542240675628248869327 (96 bits).
55725847542240675628248869327 = 108531990310739 x 513450894825493  [35.181960572s]

And here is the version compiled with -Clinker-plugin-lto in place of -C lto:

rustc --crate-type=lib ./bigint.rs --edition=2018 -C opt-level=3 -Clinker-plugin-lto
rustc --crate-name pollard_rho_brent --edition=2018 pollard_rho_brent.rs --extern bigint=/home/bruno/LTO-Pb/libbigint.rlib  -C opt-level=3 -C lto

./pollard_rho_brent 55725847542240675628248869327
Trying to factorize 55725847542240675628248869327 (96 bits).
55725847542240675628248869327 = 108531990310739 x 513450894825493  [50.939024561s]

Which give 50s on execution time.

Since cargo forces the -Clinker-plugin-lto option, I get a 'slow version'. I don't think its a cargo problem. My guess is that -Clinker-plugin-lto is forced only when building a library, and when the librairy code is only compiled with other .rs file the -Clto is used.

Doc says about this option:

There are two main cases how linker plugin based LTO can be used:

    compiling a Rust staticlib that is used as a C ABI dependency
    compiling a Rust binary where rustc invokes the linker

In both cases the Rust code has to be compiled with -C linker-plugin-lto and the C/C++ code with -flto or -flto=thin so that object files are emitted as LLVM bitcode.

My conclusion is that when building a library, Cargo forces the -Clinker-plugin-lto which (in my case at least) leads to very less efficient code. That's a bit of paradox that trying to organize nicely your code, you get a performance penalty !

For completeness, the solution given by @mbrubeck Adding codegen-units=1 to the [profile.release] with lto=true gives a decent solution, with 38s (+8%) exec time.