Wired hotspot in tokio runtime

Hello, I'm using tokio to run some io work (tokio v0.2.24 tonic v0.3.1 sqlx v0.3.5), but suddenly hit a high CPU usage problem. I didn't capture enough runtime information when it happened , only got a linux perf info, which shows Top hotspots as

dashmap::lock::RwLock$LT$T$GT$::try_read::h215958eb8505e138
core::sync::atomic::atomic::sub::h572213ee1ba22802
dashmap::lock::RwLock$LT$T$GTR$::read::h80eeeaee7796694d
core::sync::atomic::AtomicUsize::fetch_sub::hd962c365ed939d6

I suspect I'm posting too much job to tokio (or maybe some bug in library?), but since they're running async, how can I inspect the calling trace from such profile ? Thanks for advise.

Without further context, it seems most CPU time is spend on the dashmap crate which is not a dependency of the tokio. Maybe it's some other code which touched synchronized map too much?

1 Like

Yes, I do use dashmap to store some shared state, and use them in tokio async job, here is a stripped down version of my logic


pub struct IoContext {
   // use threaded_scheduler.enable_all().core_threads(8)
    runtime: tokio::runtime::Runtime,

    // map key is name of corresponding grpc client name
    // map value 1st field is the grpc server address url
    //                   2nd field is the grpc client object,  None if not connected/disconnected
    grpc_client_map: Arc<DashMap<String, (String, Option<GrpcServiceClient<Channel>>)>>,
}

// IoContext is created in rust , and using Box::into_raw to 'leak' it to C

// following function can be called from any thread from C
fn io_context_call_grpc_rust(
    io_context: *mut IoContext,
    grpc_client_key_cstr: *const c_char
) -> anyhow::Result<()> {
    let grpc_client_key = cstr_to_owned(grpc_client_key_cstr)?;
    let obj: Box<IoContext> = unsafe { Box::from_raw(io_context as *mut IoContext) };
    let leaked: &'static mut IoContext = Box::leak(obj);

    let op = CallGrpcOp {
        grpc_client_key: grpc_client_key.to_owned()
    };

   let grpc_client_map = leaked.grpc_client_map.clone();
    leaked.runtime.spawn(async move {

        
        let mut grpc_client = grpc_client_map.get_mut(&op.grpc_client_key).unwrap();
        if client.1.is_none() {
            // try to reconnect
            match reconnect_to(client.0.clone()).await.unwrap() {  //------------> here reached yield point
                Some(new_one) => {
                    client.1.replace(new_one);                                  // dashmap entry still be safe to use here ?
                }
                None => {
                    //still not connectable, give up
                    return;
                }
            };
        }
       let call_res = client.1.as_mut().unwrap().real_grpc_call().await;  // dashmap entry still be safe to use here ?
   });

  OK(())
}

I have question about dashmap usage together with tokio , I'm holding a dashmap entry across the yield point ,is it safe ?

-----edit----
BTW I'm using dashmap v4.0.1, and I found a closed issue from dashmap which said v4.0 can be used safely within tokio (Deadlock when working within async threads · Issue #79 · xacrimon/dashmap · GitHub)

No, this blocks the thread.

Thanks for the information. even author claimed the version I use should worked within tokio context, but I tried the test code in my environment and found it deadlock.

It will work as long as you don't keep it locked across a yield point.

Does 'keep it locked' means holding the write lock (guard returned by get_mut)? seems using dashmap in such situation lead 'silent' bugs. if replace it with Arc<tokio::RwLock>, should it worked as expected ?

Yes, by "holding the lock" I mean keeping the guard returned by get or get_mut alive when you do an .await. And indeed, it can easily lead to silent bugs. The Tokio tutorial actually has a page on the same problem, just with std::sync::Mutex, which you can find here. The difference with an std::sync::Mutex is that if you keep a mutex lock across an .await, that will typically throw an error about Send as described in an article, but the dashmap crate does not throw such errors because the Ref and RefMut types are Send.

Generally, I tend to recommend the pattern that the article mentions under the heading "Restructure your code to not hold the lock across an .await". The principle behind this pattern is:

If you only ever lock it in non-async methods, then you cannot possibly keep it across an .await accidentally, as non-async methods have no .awaits.

So the idea behind the pattern is to define a wrapper struct around the map and only ever lock it in non-async methods defined on that struct. For instance:

use std::sync::Mutex;

struct CanIncrement {
    mutex: Mutex<i32>,
}
impl CanIncrement {
    // This function is not marked async.
    fn increment(&self) {
        let mut lock = self.mutex.lock().unwrap();
        *lock += 1;
    }
}

async fn increment_and_do_stuff(can_incr: &CanIncrement) {
    can_incr.increment();
    do_something_async().await;
}

By using this pattern, you are guaranteed to not have such bugs.


When it comes to tokio::sync::RwLock, then, yes, it would avoid the bug. In fact, the entire reason that Tokio has its own locks is that Tokio's own locks can be safely held locked across an .await because their lock method is non-blocking. However, async locks are a lot more expensive than non-async ones, so if you are able to write your code using the pattern I outlined above instead, that would lead to much more efficient code.

1 Like

Wow, thanks for such detail answer !

You're welcome. Generally, the purpose of the async locks in Tokio is for the case where correctness of your code requires that the resource is kept locked for the full duration of an .await. If you do not need this for your code to be correct, then it is generally better to avoid doing it.

Great thanks , it helped a lot.