I have a RPC service, it will spawn lots of tasks(about 100) doing jobs when request come(about 2000qps). I perfed it and found cpu cost much in tokio::runtime::task::list::OwnedTasks<S>::bind_inner and tokio::runtime::scheduler::multi_thread::worker::<impl tokio::runtime::task::Schedule for alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle>>::release.
Any idea to optimize this? What I can think is maybe make a task pool to reuse?
Clarification: You are spawning 100 tasks per request?
It sounds like you have too many tasks, so that the overhead of task management is greater than the work each task is doing. You should try to make fewer tasks which each do more of the work. If it's not clear how to do that, show us your code.
I think you are right. The task spawned is doing much less work and mostly sync code. But I want the maximum parallel, should I use the rayon thread pool to do this? If do so, I need to find a way to communicate between sync and async.
Is this program intended to run on CPU with ~100 cores? Also, is it important to spin all the cores even with single request?
Keep in mind that synchronization itself can be quite costly, often more costly than the actual logic. Sometimes make things parallel can slow the job down in wall clock time by involving more synchronization work than parallelization gain.