Using LTO for g++ compiled static library used in Rust

For a scientific computing project I make extensive use of quadruple precision floats (f128) and these computations are the hot parts of the code. In Rust, there is the f128 crate that provides a wrapper around the quadmath extension of gcc to provide partially hardware accelerated quadprec computations.

The low-level operations on f128 numbers in C, __addtf3 etc, are wrapped around using the Wrapper type in f128.c of the f128 crate. Calling the wrapper functions f128_add, etc. causes overhead of about a factor 1.5 to a factor 2, as can be seen from this flamegraph: flamegraph
In C/C++, this substantial loss in performance can be mitigated by compiling the C library with lto:

gcc -O3 -flto -lgfortran -lquadmath -Bstatic -c f128.c
gcc-ar crf libf128.a f128.o
g++ -O3 test.c libf128.a -flto -lquadmath -o test

where test.c is a benchmark script:

#include <quadmath.h>
#include <stdio.h>
typedef __float128 f128;

typedef union _Wrapper {
  f128 value;
  unsigned __int128 dat;
  char dat_alt[16];
} __attribute__ ((aligned (16))) Wrapper;

  Wrapper f64_to_f128(double);
  void f128_to_str(Wrapper, int, char*, const char*);
  Wrapper f128_add(Wrapper*, Wrapper*);
  Wrapper f128_sub(Wrapper*, Wrapper*);
  Wrapper f128_mul(Wrapper*, Wrapper*);
  Wrapper f128_div(Wrapper*, Wrapper*);

int main() {
    Wrapper a = f64_to_f128(2.);
    Wrapper b = f64_to_f128(3.);
    Wrapper c = f64_to_f128(4.);
    Wrapper d = f64_to_f128(5.);
    for (long int i = 0; i  < (long int)10000000; i++) {
        a = f128_add(&a, &b);
        a = f128_sub(&a, &c);
        a = f128_mul(&a, &d);
        a = f128_div(&a, &c);

    printf("%f", a);

    return 0;

I am trying to achieve a similar performance boost in Rust, but I am struggling and wondering if it's even possible since Rust compiles with LLVM and we need to use g++ instead of clang for the quadmath extension.

I tried adding .flag("-flto") to the f128 crate build script, but that causes linking errors (presumably because the LLVM linker cannot read g++ LTO info). Adding .flag("-ffat-lto-objects") does restore compilation but only because LLVM can now opt to not use LTO.

Does anyone know a solution?

Rust is already capable of using cross-language lto. adding

lto = "fat"

to your Cargo.toml should enable global lto. Rust also uses the system's default linker rather than llvm's on most targets, so assuming you're on Linux you are likely already using GCC's linker rather than lld.

It does not seem to work in this case (I already had lto="fat" in my Cargo.toml, lto="thin" also doesn't work). If change the of f128 to:

        .flag("-flto") // added!

I get:

error: linking with `cc` failed: exit status: 1
  = note: "cc" "-m64" "/home/ben/Sync/Research/f128perf/target/release/deps/f128perf-79fb3239f41abefe.f128perf.f907fced-cgu.0.rcgu.o" "-Wl,--as-needed" "-L" "/home/ben/Sync/Research/f128perf/target/release/deps" "-L" "/home/ben/Sync/Research/f128perf/target/release/build/f128-5b1444e14e56fd1e/out" "-L" "/home/ben/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib" "-Wl,-Bstatic" "/tmp/rustcN5EK4U/libf128-32e2970e2534be0b.rlib" "-Wl,--start-group" "-Wl,--end-group" "/home/ben/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libcompiler_builtins-377835cfab8dae0d.rlib" "-Wl,-Bdynamic" "-lquadmath" "-lgcc_s" "-lutil" "-lrt" "-lpthread" "-lm" "-ldl" "-lc" "-Wl,--eh-frame-hdr" "-Wl,-znoexecstack" "-L" "/home/ben/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib" "-o" "/home/ben/Sync/Research/f128perf/target/release/deps/f128perf-79fb3239f41abefe" "-Wl,--gc-sections" "-pie" "-Wl,-zrelro,-znow" "-Wl,-O1" "-nodefaultlibs"
  = note: /usr/bin/ld: /home/ben/Sync/Research/f128perf/target/release/deps/f128perf-79fb3239f41abefe.f128perf.f907fced-cgu.0.rcgu.o: in function `f128perf::main':
          f128perf.f907fced-cgu.0:(.text._ZN8f128perf4main17ha2ccb8977479423bE+0x14): undefined reference to `f64_to_f128'
          /usr/bin/ld: f128perf.f907fced-cgu.0:(.text._ZN8f128perf4main17ha2ccb8977479423bE+0x75): undefined reference to `f128_add'
          /usr/bin/ld: f128perf.f907fced-cgu.0:(.text._ZN8f128perf4main17ha2ccb8977479423bE+0x95): undefined reference to `f128_sub'
          /usr/bin/ld: f128perf.f907fced-cgu.0:(.text._ZN8f128perf4main17ha2ccb8977479423bE+0xc7): undefined reference to `f128_mul'
          /usr/bin/ld: f128perf.f907fced-cgu.0:(.text._ZN8f128perf4main17ha2ccb8977479423bE+0xf9): undefined reference to `f128_div'
          /usr/bin/ld: /home/ben/Sync/Research/f128perf/target/release/deps/f128perf-79fb3239f41abefe.f128perf.f907fced-cgu.0.rcgu.o: in function `<f128::f128_t::f128 as core::fmt::Display>::fmt':
          f128perf.f907fced-cgu.0:(.text._ZN57_$LT$f128..f128_t..f128$u20$as$u20$core..fmt..Display$GT$3fmt17he3a0718c04a7fa61E+0xe2): undefined reference to `qtostr'
          collect2: error: ld returned 1 exit status

If I add flag("-ffat-lto-objects"), the code compiles, but the overhead is still there.

The problem is not cross-language, but cross-toolchain. OP already mentioned that fat LTO doesn't have the desired effect.

I tried to get f128 to compile with clang, making sure it finds quadmath:

CPATH=/usr/lib/gcc/x86_64-pc-linux-gnu/12.2.0/include/ clang f128.c -flto=thin -c -o ./f128clang.o -O2
ar crus libf128.a f128clang.o
CPATH=/usr/lib/gcc/x86_64-pc-linux-gnu/12.2.0/include/ clang test.c libf128.a -flto=thin -O2 -o test_clang

this works for my C test script test.c and the call overhead is removed.

Changing the f128 crate build script to:

println!(r"clang src/f128.c -flto=full -lquadmath -c -o ./f128clang.o -O2");
println!(r"ar crus libf128.a f128clang.o");

and running the small project

name = "f128perf"
version = "0.1.0"
edition = "2021"

opt-level = 2
lto = "fat"

lto = "fat"

f128 = {path="../f128"}
num-traits = "*"


use num_traits::cast::FromPrimitive;

fn main() {
    let mut a = f128::f128::from_f64(2.).unwrap();
    let b = f128::f128::from_f64(3.).unwrap();

    for _ in 0..10000000 {
        a = a + b - b;
        a *= b;
        a = a / b;

    println!("a={}", a);


 RUSTFLAGS="-Clinker-plugin-lto -Clinker=clang -Clink-arg=-fuse-ld=lld -L ../f128/src" cargo build --release

it does compile but it gives a runtime crash (signal SIGSEGV (Address boundary error)):

#0  0x0000555555595c49 in f64_to_f128 ()
#1  0x00005555555646b9 in f128::f128_t::{impl#7}::from_f64 (n=2) at /home/ben/Sync/Research/f128/src/
#2  f128perf::main () at /home/ben/Sync/Research/f128perf/src/

with rustc 1.61.0 and clang version 14.0.6 on x86_64-pc-linux-gnu.

This crash doens't go away when I tried -flto=thin.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.