Is there a way to optimize C FFI with Rust?

CrazyboyQCD · April 12, 2024, 1:22am

Hi folks, I recently ran a benchmark with Rust and Zig derived from ffi_ovehead .The benchmark compares the overhead of the C FFI on various programming languages.

In my benchmark, I made modifications to the C code to ensure compatibility with Windows. Here's the modified C code:

#include "plus.h"
#include <windows.h>

int plus(int x, int y) { return x + y; }

int plusone(int x) { return x + 1; }

long long current_timestamp() {
    FILETIME ft;
    ULARGE_INTEGER uli;
    GetSystemTimeAsFileTime(&ft);
    uli.LowPart = ft.dwLowDateTime;
    uli.HighPart = ft.dwHighDateTime;
    long long milliseconds = uli.QuadPart / 10000;
    return milliseconds;
}

I ran the benchmark using Rust(1.79.0-nightly) with the following code:

extern "C" {
    fn plusone(x: std::ffi::c_int) -> std::ffi::c_int;
    fn current_timestamp() -> std::ffi::c_long;
}

use std::env;

fn run(count: std::ffi::c_int) {
    unsafe {
        // start immediately
        let start = current_timestamp();

        let mut x = 0;
        while x < count {
            x = plusone(x);
        }

        let end = current_timestamp();
        let elapsed = end - start;

        println!("{}", elapsed);
    }
}

fn main() {
    let args: Vec<String> = env::args().collect();
    if args.len() == 1 {
        println!("First arg (0 - 2000000000) is required.");
        return;
    }
    let count = args[1].parse::<i32>().unwrap();
    if count <= 0 || count > 2000000000 {
        println!("Must be a positive number not exceeding 2 billion.");
        return;
    }
    run(count as std::ffi::c_int);
}

I also used Zig(0.12.0-dev.1684+ea4a07701) to run the benchmark. Here's the Zig code:

const std = @import("std");
const os = std.os;
const io = std.io;
const allocator = std.heap.c_allocator;
const c = @cImport({
    @cDefine("_NO_CRT_STDIO_INLINE", "1");
    @cInclude("stdio.h");
    @cInclude("plus/plus.h");
});

pub fn main() anyerror!void {
    const stdout_file = io.getStdOut();
    const stdout = stdout_file;

    const args = try std.process.argsAlloc(allocator);

    if (args.len == 1) {
        try stdout.writeAll("First arg (0 - 2000000000) is required.\n");
        return;
    }

    const count = try std.fmt.parseInt(i32, args[1], 10);
    if (count <= 0 or count > 2000000000) {
        try stdout.writeAll("Must be a positive number not exceeding 2 billion.\n");
        return;
    }

    run(count);
}

fn run(count: i32) void {
    const start = c.current_timestamp();

    var x: i32 = 0;
    while (x < count) {
        x |= c.plusone(x);
    }

    const elapsed = c.current_timestamp() - start;
    _ = c.printf("%d %lld\n", x, elapsed);
}

I built rust with option:

[profile.release]
codegen-units    = 1
debug-assertions = false
incremental      = false
lto              = true
opt-level        = 3
overflow-checks  = false
panic            = "abort"
rpath            = false
strip            = true

Zig with

-OReleaseFast

When I ran the benchmark with a small input value (e.g. 100,000), both Rust and Zig completed the task in 0 milliseconds.

However, when I used a larger input value (e.g. 2,000,000,000), Rust took approximately 2400 milliseconds, while Zig consistently completed it in 0 milliseconds.

This leads me to wonder whether Rust can achieve the same optimization as Zig to improve its performance in similar FFI scenarios.

DanielKeep · April 12, 2024, 2:44am

So, in the back corner of my memory is a dusty note labelled "Zig compiles to C?". If that is the case, it could be that Zig is just inlining all your "FFI" code. Heck, it might even be replacing the loop with the final result directly.

This is something Rust cannot do. The linker might be able to do LTO across languages, but I very much doubt it'd be able to eliminate the loop entirely.

To be clear: I don't know enough about Zig to say if this is true or not, just something that might be true.

(Also, just to confirm because this happens so often: I assume you are running with cargo build --release and/or cargo run --release.)

CrazyboyQCD · April 12, 2024, 4:00am

I found that Zig could translate c code to Zig code with

zig translate-c src.c

For the given example, I got this:

pub export fn plus(arg_x: c_int, arg_y: c_int) c_int {
    var x = arg_x;
    _ = &x;
    var y = arg_y;
    _ = &y;
    return x + y;
}
pub export fn plusone(arg_x: c_int) c_int {
    var x = arg_x;
    _ = &x;
    return x + @as(c_int, 1);
}
pub const DWORD = c_ulong;
pub const struct__FILETIME = extern struct {
    dwLowDateTime: DWORD = @import("std").mem.zeroes(DWORD),
    dwHighDateTime: DWORD = @import("std").mem.zeroes(DWORD),
};
pub const FILETIME = struct__FILETIME;
const struct_unnamed_1 = extern struct {
    LowPart: DWORD = @import("std").mem.zeroes(DWORD),
    HighPart: DWORD = @import("std").mem.zeroes(DWORD),
};
const struct_unnamed_2 = extern struct {
    LowPart: DWORD = @import("std").mem.zeroes(DWORD),
    HighPart: DWORD = @import("std").mem.zeroes(DWORD),
};
pub const ULONGLONG = c_ulonglong;
pub const union__ULARGE_INTEGER = extern union {
    unnamed_0: struct_unnamed_1,
    u: struct_unnamed_2,
    QuadPart: ULONGLONG,
};
pub const ULARGE_INTEGER = union__ULARGE_INTEGER;
pub export fn current_timestamp() c_longlong {
    var ft: FILETIME = undefined;
    _ = &ft;
    var uli: ULARGE_INTEGER = undefined;
    _ = &uli;
    GetSystemTimeAsFileTime(&ft);
    uli.unnamed_0.LowPart = ft.dwLowDateTime;
    uli.unnamed_0.HighPart = ft.dwHighDateTime;
    var milliseconds: c_longlong = @as(c_longlong, @bitCast(uli.QuadPart / @as(ULONGLONG, @bitCast(@as(c_longlong, @as(c_int, 10000))))));
    _ = &milliseconds;
    return milliseconds;
}

So I guess Zig Compiler internally did something similar and generate asts or some sort of ir instead of linking library, and optimized it.

Coding-Badly · April 12, 2024, 4:41am

Given the fact that GetSystemTimeAsFileTime is a foreign function you could just count the number of calls each can make to it within a given time range. I suspect you'll find the difference to be vanishingly small.

If you decide to try that be sure to use the windows-sys crate instead of the windows crate. With the later Microsoft is trying to build a Rusty layer which may add some overhead.

simonbuchan · April 12, 2024, 6:19am

In theory it's all easily inlined metaprogramming, but better to be sure for benchmarking, yeah.

The most likely culprit is that it will construct an error type if the result was a failure, which may call back to Windows FFI to get the description, which I don't know if the optimizer knows it can throw away if you don't use it.

(To clarify, it's overwhelmingly automated based on the windows API metadata project, there's not any major effort to create safe, Rusty wrappers in the windows crate itself)

GeniusIsme · April 12, 2024, 9:44am

However, when I used a larger input value (e.g. 2,000,000,000), Rust took approximately 2400 milliseconds, while Zig consistently completed it in 0 milliseconds.

Provided clock resolution of 1 millisecond (from integer representation) and assuming average 2.5 GHz processor, Zig does almost 2000 function calls per tick, which is nonsensical.
On the other hand, Rust takes approximately 3 ticks per call. Not sure it can be substaintially improved.

Anyway, make sure you are actually measuring what you think you are measuring.

CrazyboyQCD · April 12, 2024, 9:58am

As I commented above, I guess Zig translated C into Zig at first and optimized it into output of n, which Rust can't, so Zig runs faster than Rust, nothing much about cpu in this benchmark.

SkiFire13 · April 12, 2024, 9:58am

Looking at the linked repository, the build.zig file contains a call to addCSourceFiles. I'm not that familiar with Zig's build system, but that sounds suspiciously like static linking, which allows LTO. Meanwhile, most other languages are dynamically linked to the .so file, which allows no such optimization. Static and dynamic libraries have different usecases, thus the two benchmarks are IMO incomparable. To make them comparable the Rust benchmark (but the one for the other languages too) should statically link the C library, which should allow for optimizations.

Edit: actually, just linking a static library may not be enough for LTO to work.

bjorn3 · April 12, 2024, 1:11pm

You have to enable LTO on the C compiler side, make sure the C side is compiled with a clang versio. matching the LLVM rustc uses and use -Clinker-plugin-lto on the rust side for cross language LTO to work.

Topic		Replies	Views
Rust performance impact when using FFI to C help	3	1453	November 6, 2023
Is zig lang faster than rust?	89	76721	April 8, 2022
Rust, C FFI script to return a Rust code help	1	89	December 30, 2025
Will rust optimize c funtion for opt3 like this? help	6	404	September 19, 2023
Cross-language LTO doesn't really work on Windows help	7	401	September 30, 2024

Is there a way to optimize C FFI with Rust?

Related topics