Thoughts about profiling Rust/Tokio applications

Profiling Rust/Tokio Applications

The normal way people write server applications in Rust is by using Tokio, which is an asynchronous runtime. Generally every request gets its own async Tokio "task", which are scheduled across threads. Threads are normally fully async and only block on epoll_wait(2). There are 2 Tokio schedulers: the "multi-threaded runtime", where tasks can be rescheduled between threads at await points, and the "single-threaded runtime", which has lower task switch latencies but has each task pinned to the thread it started at. Applications use the scheduler that is right for them.

Rust programs are generally fast, and the time to answer a request is normally in the single-digit milliseconds if not faster. However, people generally want their programs to be faster, and they often use profiling to find up places where their programs are slow.

There are two problem that profiling tries to answer. One is "why is my program taking long to service requests". The second is "why is my CPU stuck at a high percentage that limits my throughput".

The second question is generally the easiest to answer. Good sampling profilers like async-profiler or perf give you the information - the CPU time reported by them is directly proportional to your program.

The first question is harder to currently answer.

The time a Rust program takes to answer a request can be divided into 3 parts:

  1. The CPU time for processing the request itself
  2. The time the task spends blocked on a dependency, waiting to be woken up
  3. The time the task waits for unrelated CPU-bound/thread-blocking work in order to be scheduled

Since the tokio "reactor" that is responsible for calling epoll_wait(2) and putting tasks on the runqueue is a job that needs to be scheduled (even in the multi-threaded executor in some circumstances), if the response from the dependency has already arrived but the reactor has not yet run, you can have a task that is not the runqueue but is actually in state (3).

Debugging CPU time for processing the request itself

Normally, ordinary profiling tools are good enough for this. However, for some applications, there are cases where it's hard to tell what a code is doing given just a stack trace - for example, in one large Amazon project, a single process can serve multiple "tenants", and different "tenants" can have wildly different data shapes, and knowing the shape of the data is generally a big aid in figuring out why something is slow. Different use-cases might have similar things - for example, How Our Cloud SIEM Team Uses Context Attributes With Continuous Profiler to Get Crucial Performance Insights | Datadog describes a non-Amazon service where one would like to see which "monitoring rule" is taking the CPU time.

For that, what is generally believed to help is some sort of way for application programmers to add key/value contexts to their stack traces, in the ordinary context-aware profiling fashion (for example, a TENANT=this-tenant context, and some way in the UI to see which enant is taking all the samples).

Debugging time spent blocking on dependencies

For "application" usecases, the best way of detecting time spent blocking on dependencies is either via tracing, or by emitting metrics that . Rust has decent enough tracing and metrics support. Maybe there needs to be some output integration with X-Ray to allow easier visualization.

For data-plane usecases however, that generally operate at a high amount of TPS per core (>1,000), which are a major percent of the use-cases of Rust, tracing tends to have the problem that tracing every little thing has a too high performance impact to be practical, which in my experience tends to end up with accidental performance bottlenecks often happening in the "little things" that would have been ignored.

One way to mitigate this, that I believe we should do, is to have some way to emit task dumps when a task that is supposed to finish quickly has been taking too long (multiple milliseconds) to finish (Tokio already has task dump functionality: Handle in tokio::runtime - Rust).

It might be good to integrate the task dumps with the async-profiler backend.

Debugging time waiting for unrelated CPU-bound/thread-blocking work

Due to Tokio's massive async style, tasks can wait in queue until other jobs are processed. Even when using a multi-threaded executor, in some cases the reactor polling can wait for tasks to finish, so high poll times can lead to latency for unrelated tasks. Therefore, it is important to detect the causes of O(millisecond) poll times, even if they are not a high percentage of the CPU time over a timescale of seconds.

I don't know of good prior art for debugging high poll times, but one way that I managed to make work in some cases is to keep a running counter from the last executor idle and attach the time from it to stack traces. This way, stack traces that took a high time from the last executor idle can be easily monitored and fixed.

Appendix A: What is Perf

perf(1) is a command-line tool that uses perf_event_open(2) to get information. You can see perf_event_open(2) - Linux manual page for the documentation about perf.

perf_events triggers a "perf event" every time a "performance counter" overflows. The normal performance counter generally counts CPU cycles or wall clock time, which generates periodic sampling, but other things might be counted as well (e.g. cache misses, or retired branches if you know Mozilla rr).

When a "perf event" happens.

It is possible to use perf events with POSIX async IO. In that case, when a perf event is triggered, a signal is generated.

On a perf event, generally these things happen:
mode 1: has the kernel capture an frame-pointer-based user+kernel-mode stack trace (PERF_SAMPLE_CALLCHAIN)
mode 2: has the kernel copies N bytes of stack from the kernel to userspace, for use in stack unwinding (PERF_SAMPLE_STACK_USER)
mode 3: a signal is emitted, the signal handler can do whatever it wants (e.g. take a backtrace) (async io mode)

1 Like

I think I am going to implement the "keep a running counter from the last executor idle and attach the time from it to stack traces" and "emit task dumps when a task that is supposed to finish quickly has been taking too long" solutions since I already encountered a few cases where they would have made some debugging much simpler, but I'm looking whether anyone else is interested in this area.

I'm interested in this topic (instrumenting async Rust applications) and would love to hear how you get on.

I'm very slowly looking into how far we can get with tracing and perhaps reducing the overhead. That certainly gives you the hooks to measure what you call state (3). In Tokio Console (I'm a maintainer), we call your 3 states:

  1. Busy
  2. Idle
  3. Scheduled

Where the scheduled time is from when the waker is woken until that task is next polled.

Whether the overhead could be made sufficiently low such that tracing could on during production is application specific, and certainly not possible for some use cases (but may be for many others).

That's quite a run-on sentence! :smiley: The issue you mention with tracing having its own performance impact (apart from being a fundamental limitation of nature in the Schrödinger's Cat kind of way) is a cruel truth that you just have to deal with if you want contextual information for tenancy reasons. In other words, tracing and context are one and the same, or at least definitely live in the same ballpark.

This kind of problem can be often abstracted to an I/O bound (rather than CPU bound) one, so that it can be analyzed with the same framework as profiling other I/O bound workloads. For instance, by moving the CPU bound workload to its own machine. Now the task in question is "just waiting on I/O" from the host performing the CPU workload.

I'm aware this is not always the case. The workload may require a large amount of memory that would be too costly to transfer to another host, and other similar tradeoffs. But making the problem separable means that CPU resource contention may not be a concern (especially for high-tenancy applications).

The expensive part of tracing is not keeping track of the current tenant (tho I admit that every system for keeping track of the current tenant and/or request can be called "tracing"), but rather managing the generic log messages that need to be possibly generated from many many points in the code.

If you know what is consuming CPU, there are many things you can do to avoid it slowing down your service. But that requires knowing what is consuming the CPU.

And in my experience, knowing that some part of my program is occasionally consuming a lot of CPU in production is not very helpful for finding out which part of my program is that, which is what profiling is for.

One of the biggest limitations with profilers is that they usually don't have any knowledge of user-land tasks (Future), and can only give you stack-based analysis (flamegraph and so forth). It feels like tokio-console wants to address this shortcoming.

The other big problem exists at the language level. I would love to have something in the language that is context-sensitive to "blocking" functions. There isn't a strong definition for "blocking" that will make it obvious where that would apply. Some candidates to start with would be Mutex::lock() and Receiver::recv(), but ideally not Mutex::try_lock() or Receiver::try_recv().

What makes this rather precarious is that Receiver::try_recv() does use an inner Mutex and will block [1] (however briefly) when it is under contention. With language support, I would expect this "blocking" property to bubble all the way out from that inner Mutex to my Reciever::try_recv() call in an async fn. Just give me a compiler error, even though the docs for try_recv() claim it will never block.

Assuming the "blocking" property could be applied thoroughly and false positives could be opted-out, that would resolve the biggest case of inappropriately consuming CPU time. It would include things like async spin locks that consume 10ms before yielding. And this gets into proposing a strong definition of "blocking". 10ms is eternity when most CPU operations are expected to be in the 100's of microseconds range.

I don't have any solutions. Just a dream that these bugs can be prevented in the first place.


  1. I re-discovered this issue by manually tracing code to help answer a question on the forum. The thread is linked in the GitHub issue if curious. But the frustrating thing is that finding these problems is still down to manual audits, since profiling and other analysis tools are almost no use at this time. ↩︎

2 Likes

Is there a diffrence from profiling C or any other languge? I just use the same old perf sometime vtune with basidally the same things.

Assembly dumps are slightly harder to read in Rust but i am slowly getting used to it and with cargo asm its qctually nice to read.

1 Like

When you are looking for why your CPU utilization is high, then any System V ABI compatible profiler will indeed work. This is less helpful for telling you why you have latency spikes.

What's the sort of goal you would like "knowledge of user-land tasks" for?

non-blocking is more of a system property than a function property. Even the allocator can possibly block for a long time, it's just that modern allocators are designed to avoid that.

Specifically to understand performance issues that are observable at the task level. The example of a multi-tenant server is good for understanding the requirements: If a high-value customer is observing requests that take longer than they find acceptable, you want to know why that is. No amount of stack-based analysis or machine code analysis will be helpful for this.

Receiver::try_recv() is also designed to avoid blocking for a long time, and yet its current implementation can cause WASM in a web browser to panic (uncaught exception when attempting to block the main thread on WASM atomics). "... block for a long time" is not nearly specified enough to define the issue at hand.

Rayon also have this exact issue. Trying to profile a rayon program is very difficult I have found. Often times the same code can be reached via different stack traces through rayon (not sure why, maybe "I was assigned this" vs "I stole this"?)

It is probably a generic issue for all sorts of work schedulers.

1 Like

This recent podcast explores the idea of async allocators.

It is obviously wildly incompatible with any existing code, but the idea is interesting. It would be possible to make an entirely async system, where everything that isn't CPU bound is async. In fact this has been explored in some niche/research OSes. The most well known is probably Microsoft Project Midori Research OS (which ran only managed .NET-like code), see Joe Duffy - Asynchronous Everything

So it can be done, and has some intriguing results. But it would take some massive redesigning of the existing computing landscape to get there.

I am gonna use tools like perf , samply , flamegraph , Valgrind, DHAT, and Coz to profile and to identify low-level bottlenecks. Our project may includes many dependencies to upstream and rewrite. It's critical to find best tools for our case(Rust & Tokio friendly, high precision for low-latency code, etc). Do you give me insights works best for our case related to these tools and else?

This is unfortunately not quite the right conclusion to derive from allocators taking arbitrary time to return. Blocking is defined as a Future that does not periodically yield Poll::Pending. The precise definition of periodicity or duration is debatable.

We can explore a simple example to show that Poll::Pending is what makes a Future non-blocking:

#[tokio::main(flavor = "current_thread")]
async fn main() {
    let _ = tokio::join!(
        tokio::task::spawn(yielding_fut()),
        tokio::task::spawn(blocking_fut()),
    );
}

async fn blocking_fut() {
    println!("This future is blocking...");

    loop {
        std::future::ready(()).await;
    }
}

async fn yielding_fut() {
    tokio::task::yield_now().await;

    println!("... which prevents this future from making any progress!");
}

The second line will never be printed, even though there is clearly an await keyword in the blocking future! Why is it blocking? Because it never returns Poll::Pending. If it is modified to periodically return Poll::Pending, as shown in this example: Strange tokio select behavior - #3 by parasyte you will eventually see the second task make progress (and run to completion).

The fact that some functions take "some time" to return is undesirable and contributes to the periodic property of blocking. But it is not the definition of what blocks a cooperative scheduler.

Therefore, non-blocking is a property of the function, not the system.

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.