How to track down a scaling problem?


#1

I’m currently trying to diagnose a scaling problem I have observed in benchmarking my code. Specifically, I have a test case that should be O(N), whose run time is going up faster than that. Thus I know that somewhere in my code is something that is probably O(N^2), based on the slope on a log-log plot. The trouble is, that this behavior only shows up when the computation is taking about two minutes, but even then the nonlinear contribution is a pretty small fraction of the run time. I’ve attempted the obvious approach of profiling a large-N run, but nothing popped out.

It has occurred to me that maybe microbenchmarks would be more to the point. If I can benchmark separate functions, maybe I can more easily and cheaply identify which are scaling as I expect, versus which are not. I’ve not done any of this, mostly because I want to stick with stable rust. Presumably I could create benchmarks that are only used when building with the nightly compiler?

Any other suggestions for what to do? I’m guessing somewhere in my code is a Vec::contains or similar (introduced to fix a bug, I’m sure), but it’s a pretty scary code, and I feel daunted at trying to find it.

Any suggestions would be welcome!


#2

If you put benchmarks in benches/ directory, only nightly compiler will find them, and they won’t break stable builds.


#3

It looks like the problem is in calls to /bigbro/doc/bigbro/struct.Command.html#method.spawn_and_hook, which really should be O(1). Tests clearly show that it is not, which it’s challenging for me to understand. All I can think is that something is leaking over there process of spawning and tearing down a couple of hundred thousand threads and channels, but I can’t tell what. :slightly_frowning_face:

My first thought was that threads were not being disposed of, but /proc/xxx/status shows only three threads at a time. Is it possible that std::thread::spawn is not generating detached threads?


#4

File descriptors or memory most likely. Also, check any calls to poll() or select().


#5

I checked file descriptors, and there were only fine of them, so I don’t think that’s it.


#6

I’m starting to think that perhaps the total amount of memory is at issue. The 100k job takes 500M of memory. I’m still puzzled by where I’m seeing the slowdown happen. Do you have any idea if a fork would be affected by the total amount of memory in use?


#7

fork() is very much affected by total amount of memory in use – in fact, it is O(n) in the size of the address space of the forking process! If a big process needs to launch external programs, the correct system call is vfork(), but note that there is very little one can do between calls to vfork() and exec() without causing stack corruption (essentially, just system calls). In particular, one is not allowed to return from the function that called vfork() on the child side. A good implementation of posix_spawn() should use vfork() under the hood, but I am not sure if glibc’s implementation does so.

Typical use of vfork() goes like (in C on Linux)

int buf[2];
if (pipe2(buf, O_CLOEXEC) < 0)
   return -1;
int x = vfork();
if (x < 0) {
   close(buf[0]);
   close(buf[1]);
   return -1;
} else if (x == 0) {
   /* Child process */
   execve(prog, args, env);
   /* Still alive??? */
   int q = errno;
   write(buf[1], &q, sizeof q);
   /* no error handling */
   _exit(127);
} else {
   int err;
   if (read(buf[0], &err, sizeof err) >= sizeof err) {
      int dummy;
      waitpid(x, &dummy, 0);
      errno = err;
      return -1;
   } else {
      errno = 0;
      return x;
   }
}
   

That said, if Rust doesn’t use vfork() in the stdlib then that is a bug in Rust.


#8

In my case, I am using ptrace on the child process. This requires the child process to stop itself before calling execve, which doesn’t work so well with vfork. I suppose I could create a new thread first prior to calling vfork, but that sounds like it could actually slow things down in the case where the memory use is small, since it is creating both a temporary thread and a process.

Also, vfork just scares me. Currently, one version of my code allocates memory after forking, which would of course cause bad things to happen. I have now switched that to an array on stack, since even fork doesn’t make malloc always safe in the forked process. There is an fprintf that probably wouldn’t be vfork-safe, which reports an error that can show up when seccomp is used. I am currently switching that to a call to write, because why not make it more efficient anyhow? But knowing just a little of what can go wrong on a vfork makes me frightened to make use of it.

I agree that in the simple case, vfork could be a reasonable optimization, although it would seem tempting to use a posix_spawn variant instead, which pushes the vfork-correctness obligation onto someone else…

In my example, reducing the memory use made things much faster, and that is certainly a good outcome overall!


#9

If you (or anyone) would like to audit for vfork safety, the relevant code is below:

Although there remains the complexity and cost of creating a new thread so we can restart the child after attaching to it with ptrace.