Is there a way to optimize C FFI with Rust?

Hi folks, I recently ran a benchmark with Rust and Zig derived from ffi_ovehead .The benchmark compares the overhead of the C FFI on various programming languages.

In my benchmark, I made modifications to the C code to ensure compatibility with Windows. Here's the modified C code:

#include "plus.h"
#include <windows.h>

int plus(int x, int y) { return x + y; }

int plusone(int x) { return x + 1; }

long long current_timestamp() {
    FILETIME ft;
    ULARGE_INTEGER uli;
    GetSystemTimeAsFileTime(&ft);
    uli.LowPart = ft.dwLowDateTime;
    uli.HighPart = ft.dwHighDateTime;
    long long milliseconds = uli.QuadPart / 10000;
    return milliseconds;
}

I ran the benchmark using Rust(1.79.0-nightly) with the following code:

extern "C" {
    fn plusone(x: std::ffi::c_int) -> std::ffi::c_int;
    fn current_timestamp() -> std::ffi::c_long;
}

use std::env;

fn run(count: std::ffi::c_int) {
    unsafe {
        // start immediately
        let start = current_timestamp();

        let mut x = 0;
        while x < count {
            x = plusone(x);
        }

        let end = current_timestamp();
        let elapsed = end - start;

        println!("{}", elapsed);
    }
}

fn main() {
    let args: Vec<String> = env::args().collect();
    if args.len() == 1 {
        println!("First arg (0 - 2000000000) is required.");
        return;
    }
    let count = args[1].parse::<i32>().unwrap();
    if count <= 0 || count > 2000000000 {
        println!("Must be a positive number not exceeding 2 billion.");
        return;
    }
    run(count as std::ffi::c_int);
}

I also used Zig(0.12.0-dev.1684+ea4a07701) to run the benchmark. Here's the Zig code:

const std = @import("std");
const os = std.os;
const io = std.io;
const allocator = std.heap.c_allocator;
const c = @cImport({
    @cDefine("_NO_CRT_STDIO_INLINE", "1");
    @cInclude("stdio.h");
    @cInclude("plus/plus.h");
});

pub fn main() anyerror!void {
    const stdout_file = io.getStdOut();
    const stdout = stdout_file;

    const args = try std.process.argsAlloc(allocator);

    if (args.len == 1) {
        try stdout.writeAll("First arg (0 - 2000000000) is required.\n");
        return;
    }

    const count = try std.fmt.parseInt(i32, args[1], 10);
    if (count <= 0 or count > 2000000000) {
        try stdout.writeAll("Must be a positive number not exceeding 2 billion.\n");
        return;
    }

    run(count);
}

fn run(count: i32) void {
    const start = c.current_timestamp();

    var x: i32 = 0;
    while (x < count) {
        x |= c.plusone(x);
    }

    const elapsed = c.current_timestamp() - start;
    _ = c.printf("%d %lld\n", x, elapsed);
}

I built rust with option:

[profile.release]
codegen-units    = 1
debug-assertions = false
incremental      = false
lto              = true
opt-level        = 3
overflow-checks  = false
panic            = "abort"
rpath            = false
strip            = true

Zig with

-OReleaseFast

When I ran the benchmark with a small input value (e.g. 100,000), both Rust and Zig completed the task in 0 milliseconds.

However, when I used a larger input value (e.g. 2,000,000,000), Rust took approximately 2400 milliseconds, while Zig consistently completed it in 0 milliseconds.

This leads me to wonder whether Rust can achieve the same optimization as Zig to improve its performance in similar FFI scenarios.

So, in the back corner of my memory is a dusty note labelled "Zig compiles to C?". If that is the case, it could be that Zig is just inlining all your "FFI" code. Heck, it might even be replacing the loop with the final result directly.

This is something Rust cannot do. The linker might be able to do LTO across languages, but I very much doubt it'd be able to eliminate the loop entirely.

To be clear: I don't know enough about Zig to say if this is true or not, just something that might be true.

(Also, just to confirm because this happens so often: I assume you are running with cargo build --release and/or cargo run --release.)

1 Like

I found that Zig could translate c code to Zig code with

zig translate-c src.c

For the given example, I got this:

pub export fn plus(arg_x: c_int, arg_y: c_int) c_int {
    var x = arg_x;
    _ = &x;
    var y = arg_y;
    _ = &y;
    return x + y;
}
pub export fn plusone(arg_x: c_int) c_int {
    var x = arg_x;
    _ = &x;
    return x + @as(c_int, 1);
}
pub const DWORD = c_ulong;
pub const struct__FILETIME = extern struct {
    dwLowDateTime: DWORD = @import("std").mem.zeroes(DWORD),
    dwHighDateTime: DWORD = @import("std").mem.zeroes(DWORD),
};
pub const FILETIME = struct__FILETIME;
const struct_unnamed_1 = extern struct {
    LowPart: DWORD = @import("std").mem.zeroes(DWORD),
    HighPart: DWORD = @import("std").mem.zeroes(DWORD),
};
const struct_unnamed_2 = extern struct {
    LowPart: DWORD = @import("std").mem.zeroes(DWORD),
    HighPart: DWORD = @import("std").mem.zeroes(DWORD),
};
pub const ULONGLONG = c_ulonglong;
pub const union__ULARGE_INTEGER = extern union {
    unnamed_0: struct_unnamed_1,
    u: struct_unnamed_2,
    QuadPart: ULONGLONG,
};
pub const ULARGE_INTEGER = union__ULARGE_INTEGER;
pub export fn current_timestamp() c_longlong {
    var ft: FILETIME = undefined;
    _ = &ft;
    var uli: ULARGE_INTEGER = undefined;
    _ = &uli;
    GetSystemTimeAsFileTime(&ft);
    uli.unnamed_0.LowPart = ft.dwLowDateTime;
    uli.unnamed_0.HighPart = ft.dwHighDateTime;
    var milliseconds: c_longlong = @as(c_longlong, @bitCast(uli.QuadPart / @as(ULONGLONG, @bitCast(@as(c_longlong, @as(c_int, 10000))))));
    _ = &milliseconds;
    return milliseconds;
}

So I guess Zig Compiler internally did something similar and generate asts or some sort of ir instead of linking library, and optimized it.

Given the fact that GetSystemTimeAsFileTime is a foreign function you could just count the number of calls each can make to it within a given time range. I suspect you'll find the difference to be vanishingly small.

If you decide to try that be sure to use the windows-sys crate instead of the windows crate. With the later Microsoft is trying to build a Rusty layer which may add some overhead.

In theory it's all easily inlined metaprogramming, but better to be sure for benchmarking, yeah.

The most likely culprit is that it will construct an error type if the result was a failure, which may call back to Windows FFI to get the description, which I don't know if the optimizer knows it can throw away if you don't use it.

(To clarify, it's overwhelmingly automated based on the windows API metadata project, there's not any major effort to create safe, Rusty wrappers in the windows crate itself)

However, when I used a larger input value (e.g. 2,000,000,000), Rust took approximately 2400 milliseconds, while Zig consistently completed it in 0 milliseconds.

Provided clock resolution of 1 millisecond (from integer representation) and assuming average 2.5 GHz processor, Zig does almost 2000 function calls per tick, which is nonsensical.
On the other hand, Rust takes approximately 3 ticks per call. Not sure it can be substaintially improved.

Anyway, make sure you are actually measuring what you think you are measuring.

2 Likes

As I commented above, I guess Zig translated C into Zig at first and optimized it into output of n, which Rust can't, so Zig runs faster than Rust, nothing much about cpu in this benchmark.

Looking at the linked repository, the build.zig file contains a call to addCSourceFiles. I'm not that familiar with Zig's build system, but that sounds suspiciously like static linking, which allows LTO. Meanwhile, most other languages are dynamically linked to the .so file, which allows no such optimization. Static and dynamic libraries have different usecases, thus the two benchmarks are IMO incomparable. To make them comparable the Rust benchmark (but the one for the other languages too) should statically link the C library, which should allow for optimizations.

Edit: actually, just linking a static library may not be enough for LTO to work.

2 Likes

You have to enable LTO on the C compiler side, make sure the C side is compiled with a clang versio. matching the LLVM rustc uses and use -Clinker-plugin-lto on the rust side for cross language LTO to work.

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.