I've spent the past week tracking down a bug in production that I haven't been able to explain.
A tiny fraction of redis calls seem never to return.
The code in question looks roughly like
tracing::warn!("before");
let result = cmd.query_async(&mut *conn).await;
tracing::warn!("after");
and I see the before in my logs but not the after. Here cmd: &redis::Cmd and conn: redis::MultiplexedConnection. There is a timeout on the connection which is 5 seconds (quite high) and the largest latency we see with this call is 1/3 second (the call is coming from Asia to a redis in Oregon, USA). If this timeout were triggered, we'd expect to see a result.is_err().
There is also a timeout far outside of this code on the entire request, which is called with tokio::time::timeout, which is a 10 second timeout. If this timeout were somehow triggered then it ought to show another log message (which I don't see). In other cases (e.g. other endpoints) that timeout is sometimes triggered and I do see the log message.
Another explanation for failure to see after would be if the entire code were to exit, but the code is continuing to log things for hours after the missing after. The entire thing is running in an axum server running in a pod under kubernetes, and we can confirm that the pod is still running (as I mentioned before). I don't believe the program would have restarted without the entire pod dying (and being replaced by another with a different identifier in the logs).
Any ideas what could be going wrong or how I could even debug this? I'm reduced to println style debugging in production, which is extremely slow.
This sounds like cancellation. It can happen when you have tokio::select! anywhere in the call stack. The cancelled select! arms drop the Future at any await point. If one of those dropped Futures is on the await line in question, the result is exactly what you describe.
select! is the most common way to cancel futures, but there are variations on the theme. futures_lite has a race() function which does the same thing.
We do use select! a couple of places in our code, but it shouldn't be called by a parent of this task. I expect tokio::time::timeout also uses select! internally (or in any case can result in cancellation), and that is used by a parent task, but it shouldn't be triggered without leaving a log message.
We also have axum in our stack, so I suppose it's possible that axum itself is cancelling my task? It just doesn't seem like it ought to be doing so.
Do you have any suggestions (beyond auditing our code for these sorts of functions and macros) for how one might track down cases where a future is cancelled?
You might be able to use a "drop guard" to catch cancellation in the act. A backtrace captured by the drop guard will show you who is responsible for the drop.
Yeah, very similar to this! The mem::forget is a memory leak. You can use the struct's own state to track whether it was completed successfully (bool, Option<String>, or even just the string contents). Combine it with the backtrace and you have a pretty robust troubleshooting tool for cancellation:
The expected stack frame appears in the backtrace:
4: playground::main::{{closure}}
at ./src/main.rs:45:10
My final conclusion is that the cancellation was due to broken connections, i.e. the client breaks the connection, and then hyper decides it's not worth finishing building the response. Which makes sense normally as a way to avoid DOS, but in my particular case we had some work that needed to be done regardless, so I needed to tokio::spawn the task that does that work to keep it from dying when the connection goes away.