I have a single thread app (tokio worker_threads = 1) that calls a bunch of API endpoints using the curl crate. In a for loop, I iterate through the endpoints and call an async function send_requests(endpoint_ips), store the futures in a vector and then call join_all(futures).
Since the send_requests() is called with the list of IPs for a given endpoint, I use the curl's Multi to connect to all the given IPs simultaneously, as such:
while multi.perform().unwrap() > 0 {
tokio::task::yield_now().await;
}
After that I collect the metrics from the Easy2 handles which were added to the Multi handle. For instance, connect_time()[1].
The problem with this is that the timing reported by the methods like connect_time(), pretransfer_time(), etc, seem to be getting skewed by the number of async tasks I have. For example, if I double the number of endpoints that I'm calling, the timing reported by curl will increase significantly. What I suspect to be happening is that while tokio is busy scheduling and handling other tasks/futures, the clocks for the futures it's awaiting on are still running. But I'm new to Rust, so I may be completely wrong and not doing things the right way.
Have you folks run into similar issues? Any suggestions, ideas, thoughts?
But to be honest I don't think curl to be the issue here. If I have the same logic described in my post, but instead of calling send_requests() I call another async function called that uses Tokio's async DNS resolver AsyncResolver, I get inaccurate timing:
let resolver = AsyncResolver::tokio(conf, opts);
resolver.clear_cache();
let start = Instant::now();
let lookup = resolver.lookup_ip(endpoint).await;
let end = start.elapsed();
You can use tracing spans. Instrument your async functions with #[instrument]. The span is entered when the corresponding future is polled and exited when it returns from polling.
And then calling tokio's asyncresolver's lookup_ip() like this:
let span = span!(Level::INFO, "lookup_ip");
let f = resolver.lookup_ip(endpoint);
let lookup = f.instrument(span).await;
Log:
Jul 25 22:06:41.771 INFO lookup_ip: foo: close time.busy=490µs time.idle=34.7ms
Regardless of which endpoint I resolve, the time.busy is always in the microseconds, and I can confirm the dns lookup is not cached. So I don't think this value can be used as a way to knowing how much time the resolution itself took. According to the doc it means:
span’s busy time (the total time for which it was entered) and idle time (the total time that the span existed but was not entered).