Profiling Rust/Tokio Applications
The normal way people write server applications in Rust is by using Tokio, which is an asynchronous runtime. Generally every request gets its own async Tokio "task", which are scheduled across threads. Threads are normally fully async and only block on epoll_wait(2). There are 2 Tokio schedulers: the "multi-threaded runtime", where tasks can be rescheduled between threads at await points, and the "single-threaded runtime", which has lower task switch latencies but has each task pinned to the thread it started at. Applications use the scheduler that is right for them.
Rust programs are generally fast, and the time to answer a request is normally in the single-digit milliseconds if not faster. However, people generally want their programs to be faster, and they often use profiling to find up places where their programs are slow.
There are two problem that profiling tries to answer. One is "why is my program taking long to service requests". The second is "why is my CPU stuck at a high percentage that limits my throughput".
The second question is generally the easiest to answer. Good sampling profilers like async-profiler or perf give you the information - the CPU time reported by them is directly proportional to your program.
The first question is harder to currently answer.
The time a Rust program takes to answer a request can be divided into 3 parts:
- The CPU time for processing the request itself
- The time the task spends blocked on a dependency, waiting to be woken up
- The time the task waits for unrelated CPU-bound/thread-blocking work in order to be scheduled
Since the tokio "reactor" that is responsible for calling epoll_wait(2) and putting tasks on the runqueue is a job that needs to be scheduled (even in the multi-threaded executor in some circumstances), if the response from the dependency has already arrived but the reactor has not yet run, you can have a task that is not the runqueue but is actually in state (3).
Debugging CPU time for processing the request itself
Normally, ordinary profiling tools are good enough for this. However, for some applications, there are cases where it's hard to tell what a code is doing given just a stack trace - for example, in one large Amazon project, a single process can serve multiple "tenants", and different "tenants" can have wildly different data shapes, and knowing the shape of the data is generally a big aid in figuring out why something is slow. Different use-cases might have similar things - for example, How Our Cloud SIEM Team Uses Context Attributes With Continuous Profiler to Get Crucial Performance Insights | Datadog describes a non-Amazon service where one would like to see which "monitoring rule" is taking the CPU time.
For that, what is generally believed to help is some sort of way for application programmers to add key/value contexts to their stack traces, in the ordinary context-aware profiling fashion (for example, a TENANT=this-tenant context, and some way in the UI to see which enant is taking all the samples).
Debugging time spent blocking on dependencies
For "application" usecases, the best way of detecting time spent blocking on dependencies is either via tracing, or by emitting metrics that . Rust has decent enough tracing and metrics support. Maybe there needs to be some output integration with X-Ray to allow easier visualization.
For data-plane usecases however, that generally operate at a high amount of TPS per core (>1,000), which are a major percent of the use-cases of Rust, tracing tends to have the problem that tracing every little thing has a too high performance impact to be practical, which in my experience tends to end up with accidental performance bottlenecks often happening in the "little things" that would have been ignored.
One way to mitigate this, that I believe we should do, is to have some way to emit task dumps when a task that is supposed to finish quickly has been taking too long (multiple milliseconds) to finish (Tokio already has task dump functionality: Handle in tokio::runtime - Rust).
It might be good to integrate the task dumps with the async-profiler backend.
Debugging time waiting for unrelated CPU-bound/thread-blocking work
Due to Tokio's massive async style, tasks can wait in queue until other jobs are processed. Even when using a multi-threaded executor, in some cases the reactor polling can wait for tasks to finish, so high poll times can lead to latency for unrelated tasks. Therefore, it is important to detect the causes of O(millisecond) poll times, even if they are not a high percentage of the CPU time over a timescale of seconds.
I don't know of good prior art for debugging high poll times, but one way that I managed to make work in some cases is to keep a running counter from the last executor idle and attach the time from it to stack traces. This way, stack traces that took a high time from the last executor idle can be easily monitored and fixed.
Appendix A: What is Perf
perf(1) is a command-line tool that uses perf_event_open(2) to get information. You can see perf_event_open(2) - Linux manual page for the documentation about perf.
perf_events triggers a "perf event" every time a "performance counter" overflows. The normal performance counter generally counts CPU cycles or wall clock time, which generates periodic sampling, but other things might be counted as well (e.g. cache misses, or retired branches if you know Mozilla rr).
When a "perf event" happens.
It is possible to use perf events with POSIX async IO. In that case, when a perf event is triggered, a signal is generated.
On a perf event, generally these things happen:
mode 1: has the kernel capture an frame-pointer-based user+kernel-mode stack trace (PERF_SAMPLE_CALLCHAIN)
mode 2: has the kernel copies N bytes of stack from the kernel to userspace, for use in stack unwinding (PERF_SAMPLE_STACK_USER)
mode 3: a signal is emitted, the signal handler can do whatever it wants (e.g. take a backtrace) (async io mode)