How to track down a scaling problem?

droundy · June 19, 2017, 1:06pm

I'm currently trying to diagnose a scaling problem I have observed in benchmarking my code. Specifically, I have a test case that should be O(N), whose run time is going up faster than that. Thus I know that somewhere in my code is something that is probably O(N^2), based on the slope on a log-log plot. The trouble is, that this behavior only shows up when the computation is taking about two minutes, but even then the nonlinear contribution is a pretty small fraction of the run time. I've attempted the obvious approach of profiling a large-N run, but nothing popped out.

It has occurred to me that maybe microbenchmarks would be more to the point. If I can benchmark separate functions, maybe I can more easily and cheaply identify which are scaling as I expect, versus which are not. I've not done any of this, mostly because I want to stick with stable rust. Presumably I could create benchmarks that are only used when building with the nightly compiler?

Any other suggestions for what to do? I'm guessing somewhere in my code is a Vec::contains or similar (introduced to fix a bug, I'm sure), but it's a pretty scary code, and I feel daunted at trying to find it.

Any suggestions would be welcome!

kornel · June 19, 2017, 2:47pm

If you put benchmarks in benches/ directory, only nightly compiler will find them, and they won't break stable builds.

droundy · June 20, 2017, 9:30pm

It looks like the problem is in calls to /bigbro/doc/bigbro/struct.Command.html#method.spawn_and_hook, which really should be O(1). Tests clearly show that it is not, which it's challenging for me to understand. All I can think is that something is leaking over there process of spawning and tearing down a couple of hundred thousand threads and channels, but I can't tell what.

My first thought was that threads were not being disposed of, but /proc/xxx/status shows only three threads at a time. Is it possible that std::thread::spawn is not generating detached threads?

dobenour · June 20, 2017, 9:39pm

File descriptors or memory most likely. Also, check any calls to poll() or select().

droundy · June 20, 2017, 10:22pm

I checked file descriptors, and there were only fine of them, so I don't think that's it.

droundy · June 21, 2017, 3:36pm

I'm starting to think that perhaps the total amount of memory is at issue. The 100k job takes 500M of memory. I'm still puzzled by where I'm seeing the slowdown happen. Do you have any idea if a fork would be affected by the total amount of memory in use?

dobenour · August 11, 2017, 1:43am

fork() is very much affected by total amount of memory in use – in fact, it is O(n) in the size of the address space of the forking process! If a big process needs to launch external programs, the correct system call is vfork(), but note that there is very little one can do between calls to vfork() and exec() without causing stack corruption (essentially, just system calls). In particular, one is not allowed to return from the function that called vfork() on the child side. A good implementation of posix_spawn() should use vfork() under the hood, but I am not sure if glibc’s implementation does so.

Typical use of vfork() goes like (in C on Linux)

int buf[2];
if (pipe2(buf, O_CLOEXEC) < 0)
   return -1;
int x = vfork();
if (x < 0) {
   close(buf[0]);
   close(buf[1]);
   return -1;
} else if (x == 0) {
   /* Child process */
   execve(prog, args, env);
   /* Still alive??? */
   int q = errno;
   write(buf[1], &q, sizeof q);
   /* no error handling */
   _exit(127);
} else {
   int err;
   if (read(buf[0], &err, sizeof err) >= sizeof err) {
      int dummy;
      waitpid(x, &dummy, 0);
      errno = err;
      return -1;
   } else {
      errno = 0;
      return x;
   }
}

That said, if Rust doesn't use vfork() in the stdlib then that is a bug in Rust.

droundy · August 11, 2017, 1:39pm

In my case, I am using ptrace on the child process. This requires the child process to stop itself before calling execve, which doesn't work so well with vfork. I suppose I could create a new thread first prior to calling vfork, but that sounds like it could actually slow things down in the case where the memory use is small, since it is creating both a temporary thread and a process.

Also, vfork just scares me. Currently, one version of my code allocates memory after forking, which would of course cause bad things to happen. I have now switched that to an array on stack, since even fork doesn't make malloc always safe in the forked process. There is an fprintf that probably wouldn't be vfork-safe, which reports an error that can show up when seccomp is used. I am currently switching that to a call to write, because why not make it more efficient anyhow? But knowing just a little of what can go wrong on a vfork makes me frightened to make use of it.

I agree that in the simple case, vfork could be a reasonable optimization, although it would seem tempting to use a posix_spawn variant instead, which pushes the vfork-correctness obligation onto someone else...

In my example, reducing the memory use made things much faster, and that is certainly a good outcome overall!

droundy · August 11, 2017, 1:49pm

If you (or anyone) would like to audit for vfork safety, the relevant code is below:

https://github.com/droundy/bigbro/blob/master/src/linux.rs#L311

Although there remains the complexity and cost of creating a new thread so we can restart the child after attaching to it with ptrace.

Topic		Replies	Views
Consistently inconsistent performance? Task either always takes 15ms or 50ms when run repeatedly help	15	1246	January 5, 2022
Slower when split in two files help	18	825	January 23, 2020
Profilers and how to interpret results on recursive functions help	12	8989	January 12, 2023
Segfaulting, don't know where! help	66	3007	December 3, 2023
Performance issue - High complexity code help	17	1799	July 1, 2020

How to track down a scaling problem?

Related topics