Possible issue with tokio::time::Interval

Hi, we have an issue which we suspect is with tokio::time::Interval. What we do is that we collect telemety on a min interval using tick. We see some strange behaviour that a bunch of our devices stop collecting telemetry roughly around the same time. Around 20% of our total devices. Has anyone seen any issue with Interval failing silently after a while? We have seen this happen twice and it happens roughly at the same interval since the system was last restarted, i.e the app uptime has been the same for the two occasions.

How are you using it? It sounds like you are dropping the future for the interval under certain conditions.

I am using it in a select loop where I also handle some message as well and a watchdog. The issue only happens in one of the applications but it happens across all the select branches where we use the .tick() function. We have the same duration for all the ticks that we somehow fail silently on. Could the issue lie deeper in the tokio stack or would it be related to a way we handle the polling?

You should investigate whether you are blocking the thread anywhere. One way to do this is with tokio-console.

Yeah that is something we have thought of and maybe that's the next step. Since this is currently happening in units where we don't have console enabled, we would have to re-deploy and wait for the issue to happen again. One interesting observation is also that if we do set a new Interval, as we have the ability to change it, the data starts reporting correctly again which leads me to think that somehow deep down the interval gets blocked as you say and is solved by re-initializing, if that makes sense?

With the exception of cases where the user was blocking the thread, there have never been any Tokio bug reports that sound even remotely similar to that. Not even any reports that were discovered to be due to something else.

Okay interesting, yes it's probably something blocking in our code. The fact that it only happens in one of our apps and not all supports that theory as well. We have extra complexity in these tasks as well where we use tokio-modbus to collect the data in 2 out of those 4 thread. These data collections do take between 1-4 seconds and maybe over time (we see this issue happen after around 2 weeks and quite simultaneously across the affected units).

But one case which would support "not blocking the thread" would be that the other branches in the select loop still trigger. Or are we missing something?

Well, that rules out the simplest cases, but it could still be blocking the thread. For example, imagine if the IO/timer driver is associated with a thread blocked on a buggy blocking tokio-modbus operation. Then, something happens externally that causes the IO/timer driver to get moved to another worker thread that is not blocked, and then everything suddenly trigger at once.

Interesting and good point. When you say "happens externally" could you elaborate a bit on what you would classify as externally?

Anything that happens on a thread other than the one that is blocked.

Okay, all our modbus operations seem to work fine when we call them directly over dbus using busctl. But the issue occurs when we call it over our dbus proxy, could it be that we get locked out somewhere in between the proxy and our select loop?

I would like to add more to this topic as we are still observing the issue and we don't really have any ideas on how to debug it further.

Like suggested we have tokio console setup in which we have not yet been able to reproduce the issue.

I will give some code context in response to the suggestion that we might be dropping the future for the tick.

With some simplification, this is our setup

pub struct TelemetryController {
    message_receiver: mpsc::Receiver<Message>,
    telemetry_poll_interval: SkipInterval,
}

impl TelemetryController {
    pub async fn spawn(
        message_receiver: mpsc::Receiver<Message>,
    ) -> Result<(), Error> {
        let telemetry_controller = Self {
            message_receiver,
            telemetry_poll_interval: SkipInterval::new(Duration::from_secs(INTERVAL_SECONDS)),
        };
        tokio::spawn(telemetry_controller.run());
        Ok(())
    }

    async fn run(mut self) {
        trace!("Running telemetry controller");

        let mut watchdog_request = Watchdog::subscribe("Telemetry controller").await;
        loop {
            tokio::select! {
                tick  = self.telemetry_poll_interval.tick() => {
                    debug!("tick: {tick:?} now: {:?}", tokio::time::Instant::now());
                    self.collect_telemetry().await;
                }
                Some(notification_request) = watchdog_request.recv() => {
                    notification_request.notify();
                }
            }
        }
    }

the SkipInterval is just a layer on top of the tokio interval

impl SkipInterval {
    pub fn new(duration: Duration) -> Self {
        let mut interval = interval(duration);
        interval.set_missed_tick_behavior(MissedTickBehavior::Skip);
        Self { interval }
    }

    pub async fn tick(&mut self) -> Instant {
        self.interval.tick().await
    }
}

I don't expect the future to be dropped here.

With regards to blocking, our task should not be blocked:

The watchdog that we are using is set to notify linux systemd at least once per minute and otherwise restart the application which is why I don't expect the task itself to be blocked.

the tokio time task:

In order to check the condition of the tokio time task I have added a dbus interface to our application

    fn get_time(&self) -> zbus::fdo::Result<String> {
        Ok(format!("{:?}", tokio::time::Instant::now()))
    }

when the tick is not running this instant is still increasing in time, which I interpret as the time task is not blocked.

Strange thing is the application has multiple intervals and we observe that some or all intervals stop ticking and the fault is different per system. While the issue seems to be related to uptime as it affects many(~10%) of our systems at the same time after a reboot.

Any thoughts are welcome! Sorry for the long post!

EDIT: We used to use the default MissedTickBehavior burst and observed the issue then as well. Switching to skip was a reaction to the problems we have.

Also to elaborate on what I mean with some or all of our intervals. Here is some logs from when the problem occurs (we have multiple controllers similar to the code I included above).

Jan 04 21:07:40 DEBUG diagnostic_controller: tick: Instant { tv_sec: 1073653, ... } now: Instant { tv_sec: 1073653, ... }
Jan 04 21:07:40 DEBUG telemetry_controller: tick: Instant { tv_sec: 1073653, ... } now: Instant { tv_sec: 1073653, ... }
Jan 04 21:08:40 DEBUG diagnostic_controller: tick: Instant { tv_sec: 1073713, ... } now: Instant { tv_sec: 1073713, ... }
Jan 04 21:08:40 DEBUG telemetry_controller: tick: Instant { tv_sec: 1073713, ... } now: Instant { tv_sec: 1073713, ... }
Jan 04 21:09:40 DEBUG diagnostic_controller: tick: Instant { tv_sec: 1073773, ... } now: Instant { tv_sec: 1073773, ... }
Jan 04 21:10:40 DEBUG diagnostic_controller: tick: Instant { tv_sec: 1073833, ... } now: Instant { tv_sec: 1073833, ... }

Both intervals are ticking (two different controllers, once per minute), and then suddenly one of them stop producing ticks (only diagnostic controller is still ticking once per minute).

We also have an API to change the frequency of data collection here.

Our select loop, again from above also deals with requests on frequency:

        loop {
            tokio::select! {
                tick  = self.diagnostic_poll_interval.tick() => {
                    debug!("tick: {tick:?} now: {:?}", tokio::time::Instant::now());
                    self.collect_diagnostics().await;
                }
                Some(msg) = self.message_receiver.recv() => {
                    self.handle_message(msg).await;
                }
                Some(notification_request) = watchdog_request.recv() => {
                    notification_request.notify();
                }
            }
        }

the function handle message can replace the interval

    async fn handle_message(&mut self, msg: Message) {
        match msg {
            Message::SetPollPeriod { period } => {
                self.diagnostic_poll_interval = SkipInterval::new(period);
            }
        }
    }

and replacing the interval like this makes it tick again.

Still investigating this, the pace is slow as we only see it happening after about 2 weeks of uptime :slight_smile: