I'm developing a system built on the grpc tonic of tokio, which is a high-performance service to store spark shuffle data. However, when writing data with high concurrency by grpc request, a single grpc request takes a very long time, up to 30s +. This makes me confused.
After surfacing the tokio issue and dump the await-tree and the cpu profile, I found the slow occurs on the tokio::sync::Mutex . So I replace to the std::sync:mutex . And this looks fine.
But the root cause is not solved. I still found the await function invoking in tokio sometimes costs 1s (actually I log the internal execution, only cost 1ms). And I suspect the tokio schedule problem.
I can't offer any specific help with this, but I can do the standard check: are you sure you're testing an optimised --release build? By default, cargo produces unoptimised debug builds.
Although the Tokio mutex is slower than the std mutex, the difference is not that big. If you're measuring whole seconds, then the problem is something else.
The tokio Mutex is actually just unscheduling the current task until another task releases the lock: the delay until lock succeeds is due to either another task simply holding it that long (often due it locking over too long a section of code), or you simply being out of worker threads to schedule the task on to continue after the .lock().await.
You should in theory be able to test which of these it is by increasing the tokio worker pool thread count (often by using something like #[tokio::main(worker_threads = 20)] - it defaults the your CPU count) or by also instrumenting the time from lock to unlock. For the first, you may need to ensure that you don't let any tasks run too long between .awaits, so tokio can service all your tasks quickly.
I'd start with the general theory that the tokio threads aren't running my task because they're doing something else. There are a bunch of tools I might reach for to see what that something else is, including:
a wall clock profiler (e.g. a profiler configured to get a backtrace of every thread named tokio-* every 1/100th of a second).
custom code that gets a single snapshot of all these threads right when I know the thing is broken, e.g. a timer that I start right before .lock().await and cancel right after. I've written instrumentation like this but unfortunately it was in C++ and is buried in a previous employer's proprietary codebase. But it can be super valuable.
a cpu profiler might do, if the problematic thing is CPU-bound, but that's not guaranteed.
syscall-level stuff with eBPF or strace or the like, if the problem is something doing blocking syscalls from tokio threads.
tracing, although it relies on the problematic thing having been annotated with e.g. #[tracing::instrument].
When a task has a busy time that keeps increasing, but the number of polls is not increasing, then you are blocking the thread. The link explains what that means.
After using await-tree and tokio-console, I finally replace the tokio::sync::mutex with std mutex, it looks really fast. So for my test case, the performance loss is so huge when the high concurrent grpc request.