I'm wondering how I can determine the amount of time that may have passed since a request was first received by the time my handler starts running.
It seems like when we get a huge spike of requests there is an increase in latency, but when we measure the latency of our handler it is consistently low. So I'm suspecting that these requests are sitting in a buffer somewhere, and if that's happening on my machine then I'd like to be able to detect that I'm encountering overload as much as possible.
There are a number of steps that take place before axum invokes your handler. Some of them take place entirely outside of your program:
The client sends a TCP connection request. Your OS either responds that it cannot accept the connection, or accepts the connection and puts it in the socket's accept() queue.
Your program - through tokio's TcpListener and axum's underlying HTTP network glue - accepts the connection and starts a task to read from the connection and parse an HTTP request. This task goes into a tokio queue, whose behaviour depends on which tokio runtime you're using.
Axum parses the request, building up the necessary data structures to dispatch it to your application. This usually doesn't involve much queuing, but it is an async process, so the task handling the request could be suspended while other tasks run. Personally, this is where I'd consider a request to have been received, but it does happen after some potentially important latency sources.
Axum calls into your application's service. Middleware in that service, both your own and axum's, may process the request in a multitude of ways - including handling it directly. This is usually also a linear but async process, so usually doesn't involve more queuing but is sensitive to load.
Finally, axum calls your handler function.
You can insert measurement in a few different places; you'll get different results (and spend more or less effort doing so) depending on where you want to put it. The most comprehensive but least detailed timing would be from the client - timing the total request time - possibly with some help from the server to timestamp noteworthy events, like the start of your handler.
However, that information will only help you characterize your service's latency; it's too general to help you narrow down why the number is what it is. If you want to investigate any of the dark corners of this, then you'll need to make some decisions about which ones - for example, if you suspect that requests may be sitting in the accept queue for a long time, you'd want to look at your OS' network statistics.
The client is a different company entirely, and while they give us some information (which is why we know there is a problem) we can't do very much with that information.
I've already tracked the latency in middleware near the beginning of the processing, which doesn't show signs of the large latencies observed by the client.
It's possible that the latency is entirely due to requests being sent to the wrong region, possibly across the world, and I'll try to investigate that possibility by looking at the contents of requests to guess where they might have come from. Or maybe the sender IP address?
But I'd also like to know whether perhaps the trouble is that my container is CPU-bound due to a huge surge of requests. (We know we get surges where the QPS jumps by more than a favorite of 1000 from its baseline. It seems possible that there is a long queue waiting, so each request starts being handled long after it arrives, but then is quickly responded to. I'm currently looking at the axum connection_count() value which may be enough to give us a hint that this is going on, but it would be lovely if we could somehow extract a "recieved" timestamp from as early as possible, so we could perhaps ignore requests that are already stale and manage to catch up with the ones that might still matter.
Any suggestions for how we might do that would be welcome!
I see, I'm familiar with that approach (e.g. TraceLayer), which is functionally equivalent to what we already do, which is to say running some middleware that times the request. My problem is that this doesn't show the large latencies observed by our client, and I'm hoping that there is some way to get an earlier timestamp, to see if perhaps something has already wasted time before our middleware gets a chance to even run.
Unless perhaps there is some magic in that blog post regarding how to get a timestamp from before the middleware started running?