Tokio is faster than async-std, but can I reduce overhead?

I am developing an interpreter in Rust to execute Rholang, which is a truely parallel language.
So there are a lot of small chunks of code to be executed parallelly. So they are heavily relying on Rust's futures and await.

I tried both tokio and async-std runtime.

Source code basing on tokio is here. Its flame graph shows 60.58% time is spent on tokio-runtime. :

Source code basing on async-std is here. Its flame graph shows 79.44% time is spent on async-std/runtime.

These 2 editions' codes are almost the same, except that different task::spawn is used.

Tokio is faster in my case. But can I reduce its overhead? Maybe the old-style of future-rs combinators would be better?

Thank you in advance.

It seems worth mentioning that just because a thing labeled "Tokio" uses up 60% at the bottom doesn't mean that all of those 60% is overhead. For example, in the left part of the big Tokio thing, you see a lot of things labelled rho_something if you hover the things above it. That these have a Tokio runtime thing below them just mean that this method of yours was running inside the Tokio runtime.

I would not expect the old future-rs combinators to be faster.