Why custom made Tokio HTTP server is slower than framework based like Axum, Actix, Ntex and so on?

So I just tried this

use tokio::net::{TcpListener, TcpStream};
use tokio::io::{AsyncReadExt, AsyncWriteExt};

const MAX_HEADERS: usize = 32;

#[derive(Debug)]
struct HttpRequest<'a> {
    method: &'a [u8],
    path: &'a [u8],
    version: &'a [u8],
    headers: [(&'a [u8], &'a [u8]); MAX_HEADERS],
    header_count: usize,
    body: &'a [u8],
}

fn parse_http_request<'a>(raw: &'a [u8]) -> Option<HttpRequest<'a>> {
    let mut i = 0;
    let len = raw.len();

    fn next_line(input: &[u8], start: usize) -> Option<(usize, usize)> {
        let mut pos = start;
        while pos + 1 < input.len() {
            if input[pos] == b'\r' && input[pos + 1] == b'\n' {
                return Some((start, pos));
            }
            pos += 1;
        }
        None
    }

    let (line_start, line_end) = next_line(raw, i)?;
    let line = &raw[line_start..line_end];
    i = line_end + 2;

    let mut part_start = 0;
    let mut parts: [&[u8]; 3] = [&[]; 3];
    let mut part_index = 0;
    for pos in 0..line.len() {
        if line[pos] == b' ' && part_index < 2 {
            parts[part_index] = &line[part_start..pos];
            part_index += 1;
            part_start = pos + 1;
        }
    }
    parts[part_index] = &line[part_start..];

    let method = parts[0];
    let path = parts[1];
    let version = parts[2];

    let mut headers: [(&[u8], &[u8]); MAX_HEADERS] = [(&[], &[]); MAX_HEADERS];
    let mut header_count = 0;

    while i + 1 < len {
        if raw[i] == b'\r' && raw[i + 1] == b'\n' {
            i += 2;
            break;
        }

        let (line_start, line_end) = next_line(raw, i)?;
        let line = &raw[line_start..line_end];
        i = line_end + 2;

        if let Some(colon_pos) = line.iter().position(|&b| b == b':') {
            let key = &line[..colon_pos];
            let mut val_start = colon_pos + 1;
            if val_start < line.len() && line[val_start] == b' ' {
                val_start += 1;
            }
            let value = &line[val_start..];
            if header_count < MAX_HEADERS {
                headers[header_count] = (key, value);
                header_count += 1;
            }
        }
    }

    let body = &raw[i..];

    Some(HttpRequest {
        method,
        path,
        version,
        headers,
        header_count,
        body,
    })
}

async fn handle_connection(mut stream: TcpStream) {
    
    let mut buffer = [0u8; 8192];

    match stream.read(&mut buffer).await {
        Ok(n) if n == 0 => return,
        Ok(n) => {
            let data = &buffer[..n];

            if let Some(req) = parse_http_request(data) {
                let response = match req.method {
                    b"GET" => b"HTTP/1.1 200 OK\r\nContent-Length: 13\r\n\r\nHello, world!" as &[u8],
                    _ => b"HTTP/1.1 405 Method Not Allowed\r\nContent-Length: 0\r\n\r\n",
                };

                let _ = stream.write_all(response).await;
            } else {
                let _ = stream.write_all(b"HTTP/1.1 400 Bad Request\r\nContent-Length: 0\r\n\r\n").await;
            }
        }
        Err(_) => {}
    }
}

#[tokio::main]
async fn main() {
    let listener = TcpListener::bind("127.0.0.1:8080").await.unwrap();
    println!("Server running on 127.0.0.1:8080");

    loop {
        let (socket, _) = listener.accept().await.unwrap();
        tokio::spawn(handle_connection(socket));
    }
}

It has lower performance than framework based

[root@localhost ~]# wrk -c 250 -d 15 -t 8 http://127.0.0.1:8080
Running 15s test @ http://127.0.0.1:8080
  8 threads and 250 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    70.84ms   67.03ms 508.20ms   75.74%
    Req/Sec   109.73     68.04   330.00     69.81%
  12214 requests in 15.09s, 620.24KB read
  Socket errors: connect 0, read 12180, write 0, timeout 0
Requests/sec:    809.15
Transfer/sec:     41.09KB

With Axum on the same machine my android phone I can get this result :

[root@localhost axum]# wrk -c 250 -d 15 -t 8 http://127.0.0.1:8080
Running 15s test @ http://127.0.0.1:8080
  8 threads and 250 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     4.50ms    2.05ms  36.43ms   70.96%
    Req/Sec     5.41k   569.34    11.44k    74.41%
  644716 requests in 15.09s, 79.93MB read
Requests/sec:  42723.13
Transfer/sec:      5.30MB

What are the reasons behind that slower performance result?

without looking into the details, this observation is most likely due to lack of connection reuse/pooling features.

in your code, each connection only served a single request, or in another word, each request needs an accept syscall and close syscall, in addiiton to the recv()/send() calls for data transfer.

connection setup and teardown can have significant overhead in this kind of benchmarks.

What is the wrk? I also wrote own HTTP server and curious how it's comparable with Axum, Actix and others.

Do you know an example code of how to do that sir?

Most likely this.

Wrk is one of HTTP benchmarking tool to tech HTTP 1 and HTTP 1.1 as far as I know. It doesnt support HTTP 2 and 3

Here the link : GitHub - wg/wrk: Modern HTTP benchmarking tool

Can anyone verify this result :>

Because I just tried to edit handle_connection() function like this :

async fn handle_connection(mut stream: TcpStream) {
    let mut buffer = [0u8; 8192];

    loop {
        let n = match stream.read(&mut buffer).await {
            Ok(0) => break,
            Ok(n) => n,
            Err(_) => break,
        };

        let data = &buffer[..n];

        let req = match parse_http_request(data) {
            Some(req) => req,
            None => {
                let _ = stream.write_all(b"HTTP/1.1 400 Bad Request\r\nContent-Length: 0\r\n\r\n").await;
                break;
            }
        };

        let response: &[u8] = match req.method {
    b"GET" => b"HTTP/1.1 200 OK\r\nContent-Length: 13\r\nConnection: keep-alive\r\n\r\nHello, world!" as &[u8],
    _ => b"HTTP/1.1 405 Method Not Allowed\r\nContent-Length: 0\r\nConnection: close\r\n\r\n" as &[u8],
};


        if stream.write_all(response).await.is_err() {
            break;
        }

        // Cek apakah client ingin tutup koneksi
        let mut connection_close = false;
        for i in 0..req.header_count {
            if req.headers[i].0.eq_ignore_ascii_case(b"connection") && req.headers[i].1.eq_ignore_ascii_case(b"close") {
                connection_close = true;
                break;
            }
        }

        if connection_close {
            break;
        }
    }
}

Then suddently the performance is a bit higher than Axum in req/s

[root@localhost ~]# wrk -c 250 -d 15 -t 8 http://127.0.0.1:8080
Running 15s test @ http://127.0.0.1:8080
  8 threads and 250 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.95ms    2.07ms  42.41ms   75.72%
    Req/Sec     6.06k   842.34     8.24k    72.07%
  719507 requests in 15.09s, 52.15MB read
Requests/sec:  47672.38
Transfer/sec:      3.46MB

I dont know how to verify it if its true ><

A couple of observations:

  1. Have you tried comparing it to hyper directly? axum uses it "behind the scenes". Removing the extra "stuff" axum does can at least help pinpoint the issue.
  2. What does the performance look like if you run the loop in a spawned task? listener.accept() and tokio::spawn are called within Runtime::block_on (indirectly) as opposed to within a spawned task which will be faster. Granted if your axum test was doing the same, then they both should be similarly affected. See the comments by Alice Ryhl who is one of the maintainers of tokio. Additionally, tokio mentions this as well.

Thanks guys, it sounds promising. However having 400 concurrent connection will require to have 400+ running threads I didn't test for my server yet. Certainly I need to look in non blocking http connections where one thread can service them all.

Not yet, but based on my past comparison between plain Hyper, Axum, and Actix they are mostly comparable

In my above result there is mistake I just remembered, I forgot that I made my Axum implementation single threaded in yesterday experiment :<

Setting it back to multithreaded runtime makes Axum performance comparable now

[root@localhost axum]# wrk -c 250 -d 15 -t 8 http://127.0.0.1:8080
Running 15s test @ http://127.0.0.1:8080
  8 threads and 250 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     4.50ms    2.05ms  36.43ms   70.96%
    Req/Sec     5.41k   569.34    11.44k    74.41%
  644716 requests in 15.09s, 79.93MB read
Requests/sec:  42723.13
Transfer/sec:      5.30MB

It has higher transfer/sec and read speed, whats possible reason of these higher result?

Thank youu, Im about to try your number 2 suggestion to know how it will result

To clarify, you're saying the performance is better when using the current-thread runtime? Per Alice's comment, running the loop in a spawned task won't improve anything if you're using the current-thread runtime but only if you're using the multi-threaded one.

Not sir I remembered I had not set Axum back to multithreaded runtime :< because last time I compare single threaded runtime and multi threaded runtime

My Axum jumps from 20k req/s to 42k req/s after set it back to multithreaded runtime

Yeah sir the number 2 technique improved the performance. Req/s increases and faster latency

[root@localhost]# wrk -c 250 -d 15 -t 8 http://127.0.0.1:8080
Running 15s test @ http://127.0.0.1:8080
  8 threads and 250 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.86ms    1.86ms  25.29ms   72.85%
    Req/Sec     6.28k   684.40     8.47k    73.84%
  746917 requests in 15.10s, 54.14MB read
Requests/sec:  49472.32
Transfer/sec:      3.59MB
[root@localhost]#

Although it still has lower read speed and transfer/sec than Axum, do you know what is possible cause of these?

My new code is like this

#[tokio::main]
async fn main() {
    let listener = TcpListener::bind("127.0.0.1:8080").await.unwrap();
    println!("Server running on 127.0.0.1:8080");

    tokio::spawn(async move {
        loop {
            match listener.accept().await {
                Ok((socket, _)) => {
                    tokio::spawn(handle_connection(socket));
                }
                Err(e) => {
                    eprintln!("Failed to accept connection: {:?}", e);
                }
            }
        }
    });

    tokio::signal::ctrl_c().await.unwrap();
    println!("Shutting down.");
}

Is that what you was referring?

I don't, and I don't care about these kinds of benchmarks so I'm unable and unwilling to assist further. I think your best bet is to iteratively add code to your implementation until it becomes what a hyper-based solution does to pinpoint where the "bottleneck" is. It might be how http parses headers and the like. It might be how hyper utilizes Bytes. I honestly don't know.

Additionally, while this kind of thing can be useful for learning; it also shows how easy it is to incorrectly think a bespoke implementation of something will have better performance than a more general but well-tested library due to the fact that some optimizations are not obvious. This means even for a pretty bare-bones application that doesn't need all the bells and whistles that a "framework" like axum provides, there is a decent chance writing the code yourself will be premature optimization that further fails on the "optimization" part.

Yes.