My ld_preload library intercepts various GLIBC filesystem access calls and writes that data to a tracking file. I am trying to understand how I can improve the over head I am seeing.
Right now my builds are running at 3 times their normal run times.
My initial C version of the libwisktrack.so LD_PRELOAD library didnt have this much overhead. probably added 10-15% over head I would think
I have a test program that invokes readlink() 1,000,000 times in a loop and takes 1.13 seconds by itself running standalone without any LD_PRELOAD as below.
bash-4.4$ tests/testprog64 readlink 1000000
testprog running......
Program runtime: 1.13 seconds
My total overhead in my commited code is 10 seconds, with a run time of 11.35 seconds
Breaking things down.
Just a pass through interception of readlink with no additional functionlity
hook! {
unsafe fn readlink(path: *const libc::c_char, buf: *mut libc::c_char, bufsiz: libc::size_t) -> libc::ssize_t => (my_readlink,SYS_readlink, true) {
real!(readlink)(path, buf, bufsiz)
}
}
Costs an extra .11 seconds roughly about 10%.
+ env -i RUST_BACKTRACE=1 TERM=xterm-256color HOME=/users/sarvi 'LD_PRELOAD=/ws/sarvi-sjc/wisktrack/${LIB}/libwisktrack.so' PATH=/usr/share/Modules/bin:/usr/local/go/bin:/users/sarvi/.cargo/bin:/auto/binos-tools/bin:/router/bin:/usr/cisco/bin:/usr/atria/bin:/usr/bin:/usr/local/bin:/usr/local/etc:/bin:/usr/X11R6/bin:/usr/sbin:/sbin:/usr/bin:/auto/nova-env/Lab/Labpatch/bin USER=sarvi WISK_TRACE= WISK_TRACK= WISK_CONFIG= WISK_WSROOT=/ws/sarvi-sjc/wisktrack tests/testprog64 readlink 1000000
testprog running......
Program runtime: 1.24 seconds
real 0m1.330s
user 0m0.391s
sys 0m0.862s
the hook {} macro above is defined in my fork of the redhook library
https://github.com/sarvi/redhook/blob/master/src/ld_preload.rs
Adding the following code brings it up 1.27 seconds. Where initialized() is defined here
https://github.com/sarvi/redhook/blob/master/src/lib.rs
if initialized() {
}
real!(readlink)(path, buf, bufsiz)
Adding the reporting code into the intercept
if initialized() {
TRACKER.reportreadlink(path);
}
real!(readlink)(path, buf, bufsiz)
where reportreadlink is defined as follows, takes the run time to 3.49 seconds, Adding 2.2 seconds overhead, almost 2-3 times.
Here self.report() is dummy and returns immediately for now. It is the function that will eventually be writing the data to dependency tracking file.
pub unsafe fn reportreadlink(self: &Self, path: *const libc::c_char) {
let args = (utils::ptr2str(path), );
self.report("READLINK", &serde_json::to_string(&args).unwrap());
}
pub unsafe fn ptr2str<'a>(ptr: *const c_char) -> &'a str {
CStr::from_ptr(ptr).to_str().unwrap()
}
Focusing on the big costs here serde_json::to_string(&args).unwrap()
seems like an expensive operation, and all its creating is
["/tmp/wisk_testlink"]
If self.report then writes to the file, the time goes up to 5.04 seconds, the overhead of writing to a file and incurs a syscall to write. This I plan on buffering, so I can reduce the number of sys calls.
For now, cost of taking C char pointer that contains a simple C string path, and turning it into a json string that is ready to be written to a file, seems expense.
Is there a better, more efficient way of doing this?
I was also expecting accessing to "TRACKER.reportreadlink()" which is a lazy static instance of the Tracker class to involve some mutex, but the strace doesnt seem to show that happening. I am wondering if I should turn TRACKER into a thread_local {} for 2 reasons.
- no contention when accessing TRACKER, in a multi threaded program.
- I want to have a thread local buffer, where I can buffer the generated by TRACKER.reportreadlink() and only call the write syscall, once every 4K bytes roughly to keep thhe syscall overhead low.