Thank youu so much for the detailed explanation
So it is still quite hard to pinpoint the exact root cause right now. The most likely cause is resource exhaustion. Anyway, I have just made a new version of the code, I decided to drop atomics completely since I need to store all the latency data for further statistical analysis (like median and mode). I am also working on categorizing latencies into ranges and counting the total indivodual of them (0–50 ms, 50–100 ms, 100–150 ms, 150–200 ms, and 200+ ms) to get how many latency that is between 150-200 ms etc, so it will give more detailed informations. On top of that, I plan to add the ability to test multiple endpoint url at the same time, so like I test "/url_1" and "/url_2" at the same time, it can give clear view of the performance drop effect of long cpu compute code inside async event loop thread (in tokio is tokio worker threads). I also plan to add perf wrapper function to also track other stats like CPU usage, cache misses, branch misses, and RAM usage that are easy to use.
The reason I do not use wrk is because it counts every request, regardless of whether it is 200 ok or a failed request. I noticed this a few months ago when testing with a simple scenario. I wanted to validate whether wrk was reliable, because I saw that my custom HTTP framework seemed to outperform Axum, Actix, etc. That made me wonder if wrk itself might be buggy. To check it, I created a simple Axum backend with a shared counter like this:
// pseudo code, not actual working code
fn count_handler() -> String {
total_req += 1;
format!("current total reqs: {}", total_req)
}
Then I ran wrk to this counter endpoint. Since the counter increments on every incoming request, I expected wrk’s reported request count to match the counter value. After wrk finished, I visited the counter endpoint manually in the browser to see the number. At that point, the counter should equal wrk’s result + 1 (because of my manual visit). But the numbers did not match, wrk's total number is much higher over 2x. That is when I realized wrk is not reliable, it gives the illusion of extremely high throughput even when the actual server counter shows very different number.
After that, I tried k6, and its results matched my backend’s counter perfectly. That gave me confidence that k6 is the correct tool. Then I checked and saw k6 is written in Go, which made me think, if I build my own tool in Rust, I should be able to spawn even more connections. And indeed, my tool can reach 390,000 connections compared to k6’s 310,000 under the same duration and max concurrent.
But I am still investigating latency, it shows higher max latency than k6 even though the tested backend is the same. The max latency can reach 200–300 ms, while k6 maxes out around 150 ms. I am not sure yet whether this is caused by the extra 80,000 connections or if it is because my code still has unoptimized code. Do you spot any possible causes for that in the code below?
Sorry for possible bad english grammar I use translate 
use std::sync::{Arc};
use std::time::Instant;
use tokio::sync::Semaphore;
use tokio::time::Duration;
use tokio::task::JoinSet;
use crossbeam_channel::unbounded;
use tokio::runtime::Runtime;
use std::collections::HashMap;
struct Data {
time: Option<Duration>,
total_send: Option<u64>,
}
fn main() -> Result<(), Box<dyn std::error::Error>> {
let (s, r) = unbounded::<Data>();
let start = Instant::now();
let runtime = Runtime::new().unwrap();
runtime.block_on(async {
let url = "http://127.0.0.1:8080/4";
let semaphore = Arc::new(Semaphore::new(200));
let client = Arc::new(
reqwest::Client::builder()
.pool_max_idle_per_host(200)
.pool_idle_timeout(Duration::from_secs(40))
.timeout(Duration::from_secs(5))
.build()
.unwrap(),
);
let mut join_set = JoinSet::new();
let batch_size = 100;
let max_concurrent = 200;
let s_ref = s.clone();
while start.elapsed().as_secs() <= 30 {
for _ in 0..batch_size {
if start.elapsed().as_secs() >= 30 {
break;
}
let client_ref = client.clone();
let sem_ref = semaphore.clone();
let s_ref = s_ref.clone();
join_set.spawn(async move {
let _permit = sem_ref.acquire().await.unwrap();
let request_start = Instant::now();
match client_ref.get(url).send().await {
Ok(resp) if resp.status().is_success() => {
let data = Data {
time: Some(request_start.elapsed()),
total_send: None,
};
s_ref.send(data).unwrap();
}
_ => {}
}
let data = Data {
time: None,
total_send: Some(1),
};
s_ref.send(data).unwrap();
});
}
while join_set.len() > max_concurrent {
if let Some(_) = join_set.try_join_next() {
continue;
}
tokio::task::yield_now().await;
}
}
while let Some(_) = join_set.join_next().await {}
});
let duration = start.elapsed();
drop(s);
let mut times = vec![];
let mut total_send = 0;
for val in r.iter() {
if let Some(val) = val.time {
times.push(val.as_nanos() as u64);
} else if let Some(val) = val.total_send {
total_send += val;
}
}
times.sort();
let success = times.len();
let total_times = times.iter().sum::<u64>();
let req_per_sec = success as f64 / duration.as_secs_f64();
let min_time = times.iter().min().unwrap();
let max_time = times.iter().max().unwrap();
println!(
"Success: {}/{} in {:.2} seconds",
success,
total_send,
duration.as_secs_f64()
);
println!("Requests per second: {:.2}", req_per_sec);
if total_send > 0 {
println!(
"Success rate: {:.2}%",
(success as f64 / total_send as f64) * 100.0
);
} else {
println!("Success rate: 0.00%");
}
let min_ms = *min_time as f64 / 1_000_000.0;
let max_ms = *max_time as f64 / 1_000_000.0;
let avg_ms = (total_times as f64 / success as f64) / 1_000_000.0;
let median_ms = if success % 2 == 0 {
let mid = success / 2;
(times[mid - 1] + times[mid]) as f64 / 2.0 / 1_000_000.0
} else {
times[success / 2] as f64 / 1_000_000.0
};
let mut freq: HashMap<u64, usize> = HashMap::new();
for &t in × {
*freq.entry(t).or_insert(0) += 1;
}
let (mode_val, _) = freq.into_iter().max_by_key(|&(_, count)| count).unwrap();
let mode_ms = mode_val as f64 / 1_000_000.0;
let p90_idx = (0.90 * (success as f64 - 1.0)) as usize;
let p99_idx = (0.99 * (success as f64 - 1.0)) as usize;
let p90_ms = times[p90_idx] as f64 / 1_000_000.0;
let p99_ms = times[p99_idx] as f64 / 1_000_000.0;
println!("Min: {:.2} ms", min_ms);
println!("Max: {:.2} ms", max_ms);
println!("Avg: {:.2} ms", avg_ms);
println!("Median: {:.2} ms", median_ms);
println!("Mode/Modus: {:.2} ms", mode_ms);
println!("p90: {:.2} ms", p90_ms);
println!("p99: {:.2} ms", p99_ms);
Ok(())
}