Need help with profiling/improving performance of my LD_PRELOAD library

My ld_preload library intercepts various GLIBC filesystem access calls and writes that data to a tracking file. I am trying to understand how I can improve the over head I am seeing.
Right now my builds are running at 3 times their normal run times.

My initial C version of the libwisktrack.so LD_PRELOAD library didnt have this much overhead. probably added 10-15% over head I would think

I have a test program that invokes readlink() 1,000,000 times in a loop and takes 1.13 seconds by itself running standalone without any LD_PRELOAD as below.

bash-4.4$ tests/testprog64 readlink 1000000
testprog running......
Program runtime: 1.13 seconds

My total overhead in my commited code is 10 seconds, with a run time of 11.35 seconds

Breaking things down.
Just a pass through interception of readlink with no additional functionlity

hook! {
    unsafe fn readlink(path: *const libc::c_char, buf: *mut libc::c_char, bufsiz: libc::size_t) -> libc::ssize_t => (my_readlink,SYS_readlink, true) {
        real!(readlink)(path, buf, bufsiz)
    }
}

Costs an extra .11 seconds roughly about 10%.

+ env -i RUST_BACKTRACE=1 TERM=xterm-256color HOME=/users/sarvi 'LD_PRELOAD=/ws/sarvi-sjc/wisktrack/${LIB}/libwisktrack.so' PATH=/usr/share/Modules/bin:/usr/local/go/bin:/users/sarvi/.cargo/bin:/auto/binos-tools/bin:/router/bin:/usr/cisco/bin:/usr/atria/bin:/usr/bin:/usr/local/bin:/usr/local/etc:/bin:/usr/X11R6/bin:/usr/sbin:/sbin:/usr/bin:/auto/nova-env/Lab/Labpatch/bin USER=sarvi WISK_TRACE= WISK_TRACK= WISK_CONFIG= WISK_WSROOT=/ws/sarvi-sjc/wisktrack tests/testprog64 readlink 1000000
testprog running......
Program runtime: 1.24 seconds

real    0m1.330s
user    0m0.391s
sys     0m0.862s

the hook {} macro above is defined in my fork of the redhook library
https://github.com/sarvi/redhook/blob/master/src/ld_preload.rs

Adding the following code brings it up 1.27 seconds. Where initialized() is defined here
https://github.com/sarvi/redhook/blob/master/src/lib.rs

        if initialized() {
        }
        real!(readlink)(path, buf, bufsiz)

Adding the reporting code into the intercept

        if initialized() {
            TRACKER.reportreadlink(path);
        }
        real!(readlink)(path, buf, bufsiz)

where reportreadlink is defined as follows, takes the run time to 3.49 seconds, Adding 2.2 seconds overhead, almost 2-3 times.
Here self.report() is dummy and returns immediately for now. It is the function that will eventually be writing the data to dependency tracking file.

    pub unsafe fn reportreadlink(self: &Self, path: *const libc::c_char) {
        let args = (utils::ptr2str(path), );
        self.report("READLINK", &serde_json::to_string(&args).unwrap());
    }

pub unsafe fn ptr2str<'a>(ptr: *const c_char) -> &'a str {
    CStr::from_ptr(ptr).to_str().unwrap()
}

Focusing on the big costs here serde_json::to_string(&args).unwrap() seems like an expensive operation, and all its creating is
["/tmp/wisk_testlink"]
If self.report then writes to the file, the time goes up to 5.04 seconds, the overhead of writing to a file and incurs a syscall to write. This I plan on buffering, so I can reduce the number of sys calls.

For now, cost of taking C char pointer that contains a simple C string path, and turning it into a json string that is ready to be written to a file, seems expense.

Is there a better, more efficient way of doing this?

I was also expecting accessing to "TRACKER.reportreadlink()" which is a lazy static instance of the Tracker class to involve some mutex, but the strace doesnt seem to show that happening. I am wondering if I should turn TRACKER into a thread_local {} for 2 reasons.

  1. no contention when accessing TRACKER, in a multi threaded program.
  2. I want to have a thread local buffer, where I can buffer the generated by TRACKER.reportreadlink() and only call the write syscall, once every 4K bytes roughly to keep thhe syscall overhead low.

The lazy_static crate uses std::sync::Once under the hood, so checking if it is initialized is just an atomic load with Ordering::Acquire.

To be put in a lazy_static, TRACKER's type needs to be Sync. That means it will be doing its own synchronisation internally in some form or another.

If you are just writing an array containing a string, couldn't you cut serde-json out and just use the write!() macro to generate correctly formatted JSON directly?

I would have the cost of the Once check, would be a single load and check for all subsequent operations after the first time. Is it more?

Tracker is my own structure/class and I have not implemented the Sync trait for it. And I am not sure what the cost of a Sync Trait is in a singple or multi threaded environment. I am hoping there is no "locking" involved. The object and its API are written to executed in parallel and dont require locking.

  1. Looks like I have to write some API to keep the content in null terminate C format, possibly as CStr and
    CString, and write diretly into the write buffer/syscall.
  2. I am trying to write JSON formatted output, so that would involve escaping some characters, quoting etc. And so it looks like I need to do that in C as I am writing it out to the destination buffer/syscall.

Performance Profiling Tooling for RUST?
Also, is there performance profiler like the python profiler. That will help me visualized how the verious parts of my code and library invocations are contriibuting to the run time ?

The following seems to have addressed the last question on profiling

RUSTFLAGS='-C force-frame-pointers=y' cargo build
env 'LD_PRELOAD=/ws/sarvi-sjc/wisktrack/${LIB}/libwisktrack.so'  perf record -g /ws/sarvi-sjc/wisktrack/tests/testprog64 readlink 1000000
perf report -g graph,0.5,caller

Rust compiles to native code, so you can use the same profiling tools that you'd use for other languages (e.g. C).

perf and flamegraph are quite popular, although you need to tell cargo to include debug symbols when compiling in release mode.

# Cargo.toml

[profile.release]
debug = true

You don't really need to touch CStr and CString.

If you are writing to a buffer (e.g. a [u8; 1024] on the stack or in a thread-local) then just set the null terminator yourself and pass the buffer pointer and length (+1 for the terminator) to libc::write().

Why do you need to do it in C? Rust is perfectly capable of doing the formatting and invoking syscalls (see the syscall crate).

Yes I could do that. At this point though, it makes me wonder, if I can't even do the actual "work" I need to do which is "write some tracking data in JSON format to my tracking file" using any RUST libraries, and have to resort to invoking syscalls and unsafe functions across the board, then I dont see any benefits to this rewrite in C.

To be honest though, All that work is capture 1 or 2 C strings, from intercept libc API and write it out a translation file. I puzzled as to how something as simple as this could have so much over head on the intercepted function.

I will continue to tweak it and see how far I can push this.

I meant as CStr/CString

Shouldn't you be compiling your code in release mode? It looks like your Makefile is only generating debug builds (all paths are target/debug/, no target/release/ directories or cargo build --release), so I'm not surprised you are having performance issues.

Like C++, a lot of Rust's performance comes from the optimiser inlining all the zero-cost abstractions and performing optimisations. There's no point trying to profile a debug build.

1 Like

Thanks for the pointer.

bash-4.4$ tests/testprog64 readlink 1000000
testprog running......
Program runtime: 1.05 seconds

Vs

+ env -i RUST_BACKTRACE=1 TERM=xterm-256color HOME=/users/sarvi 'LD_PRELOAD=/ws/sarvi-sjc/wisktrack/${LIB}/libwisktrack.so' PATH=/usr/share/Modules/bin:/usr/local/go/bin:/users/sarvi/.cargo/bin:/auto/binos-tools/bin:/router/bin:/usr/cisco/bin:/usr/atria/bin:/usr/bin:/usr/local/bin:/usr/local/etc:/bin:/usr/X11R6/bin:/usr/sbin:/sbin:/usr/bin:/auto/nova-env/Lab/Labpatch/bin USER=sarvi WISK_TRACE= WISK_TRACK= WISK_CONFIG= WISK_WSROOT=/ws/sarvi-sjc/wisktrack tests/testprog64 readlink 1000000
testprog running......
Program runtime: 2.79 seconds

Thats a big improvement from 11.35.

I didnt realize debug image would have this big of a performance difference.

Is there further optimizations beyond --release that might be useful.

Depends on what you need and what you can afford to enable/disable:

It's not uncommon for unoptimised Rust to be 10-50x slower than a release build. Optimisations have a much bigger effect than languages like C because the language puts a bigger emphasis on "zero-cost abstractions".

On one hand this is awesome because we get to write high-level code like iterators and closures, but it also means you need the optimiser to peel away all these abstractions and trivial functions in order to "see" enough context to generate good quality machine code.

If you want to improve performance further, you'll probably want to step through all the code that gets triggered in your hook and ask whether it is necessary. There's a good chance you are doing a lot more work than you need to because it's just so easy to add dependencies in Rust. That's in comparison with C where the only dependency you ever really use is libc, so everything else would have been written by you.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.