Why is complicated one faster

I built a multithread web server for learning. When testing its concurrency performance, I found it relatively acceptable. However when I try to reproduce this in a simpler version, its performance dropped drastically, which is anti-intuitive to me:
the simplified version:

use std::net::{TcpListener, TcpStream};
use std::io::Write;
use threadpool::ThreadPool;

fn main() {
    let listener = TcpListener::bind("127.0.0.1:4231").unwrap();
    let pool = ThreadPool::new(10);
    for stream in listener.incoming() {
        let stream = stream.unwrap();
        pool.execute(|| {
            handle_connection(stream);
        })
        
    }

}

fn handle_connection(mut stream: TcpStream) {
    let contents = "{\"balance\": 0.00}";



    stream.write_all(b"HTTP/1.1 200 OK\r\n\r\n").unwrap();
}

and its performance( oha -m GET http://127.0.0.1:4231/ -n 4000):

Summary:
  Success rate: 7.40%
  Total:        0.2556 secs
  Slowest:      0.0176 secs
  Fastest:      0.0005 secs
  Average:      0.0019 secs
  Requests/sec: 15648.6258

  Total data:   0 B
  Size/request: 0 B
  Size/sec:     0 B

on the other hand, the result of the complicated version is:

Summary:
  Success rate: 99.98%
  Total:        0.1906 secs
  Slowest:      0.0266 secs
  Fastest:      0.0004 secs
  Average:      0.0023 secs
  Requests/sec: 20987.6472

  Total data:   0 B
  Size/request: 0 B
  Size/sec:     0 B

You haven't shown us the more complex version, so it's not possible to tell why this is.

There are other potential problems as well:

  • What is the "success rate"? It seems drastically (>10-fold) lower in the simplified case. Maybe it's something that affects throughput?
  • Are you building both test servers with optimizations enabled (cargo build --release)?
  • Are you actually measuring what you intended to measure? As currently standing, this measures requests per second. That is not only influenced by raw throughput but also by latency if the requests are serialized. Are you maybe using async in the more complicated code, allowing the server to pipeline (partially parallelize) requests that are blocked while serving others, amortizing latency over several requests? Raw threading will not be able to do that for more simultaneous requests than physical CPU cores.

Thank you for your timely reply.
The implementation of the server has been uploaded to github and the link to this repo was added as a hyperlink in my original post, for I think pasting the whole content would otherwise lead to redundancy.
As for the measurement:
I used a tool called oha to make multiple requests from localhost. The success rate is calculated by # of success respond / # of total request, and both result are tested with 4000 requests. The full content of the summary is like this:


So far this implementation doesn't contain any async code, and I set the number of threads the exact number of my cpu cores.
I just started learning this langauge since last week, so if any silly mistakes were made, I apologize for that. XD.

The simplified version has a pool of size 10 threads, the complicated version has 4.
I suggest making them the same to see if that accounts for the difference.

What tanks the success rate, i.e. what kind of errors are you getting?
I see that the simple version is not reading the input at all, just spewing out responses and then dropping the connection.
I speculate that this might close the receiving end faster than the client can send its request which will result in a client-side write error.

1 Like

I didn't try to try this but I believe it's a quite good point. The Full summary is as below along with the error message distribution:

oha http://127.0.0.1:4231
Summary:
  Success rate: 15.00%
  Total:        0.0328 secs
  Slowest:      0.0194 secs
  Fastest:      0.0005 secs
  Average:      0.0023 secs
  Requests/sec: 6096.7059

  Total data:   0 B
  Size/request: 0 B
  Size/sec:     0 B

Response time histogram:
  0.000 [1]  |■
  0.002 [25] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.004 [3]  |■■■
  0.006 [0]  |
  0.008 [0]  |
  0.010 [0]  |
  0.012 [0]  |
  0.014 [0]  |
  0.016 [0]  |
  0.018 [0]  |
  0.019 [1]  |■

Response time distribution:
  10.00% in 0.0010 secs
  25.00% in 0.0014 secs
  50.00% in 0.0018 secs
  75.00% in 0.0021 secs
  90.00% in 0.0029 secs
  95.00% in 0.0031 secs
  99.00% in 0.0194 secs
  99.90% in 0.0194 secs
  99.99% in 0.0194 secs


Details (average, fastest, slowest):
  DNS+dialup:   0.0017 secs, 0.0004 secs, 0.0193 secs
  DNS-lookup:   0.0000 secs, 0.0000 secs, 0.0001 secs

Status code distribution:
  [200] 30 responses

Error distribution:
  [79] connection error: An established connection was aborted by the software in your host machine. (os error 10053)
  [77] connection error: An existing connection was forcibly closed by the remote host. (os error 10054)
  [10] error reading a body from connection: An existing connection was 
forcibly closed by the remote host. (os error 10054)
  [2] error reading a body from connection: An established connection was aborted by the software in your host machine. (os error 10053)        
  [1] An invalid argument was supplied. (os error 10022)
  [1] operation was canceled: received unexpected message from connection

Thanks a lot!

200 requests to 30 responses😥

Sure, it was 4 when I committed to Github, and I aligned them to 10 when testing. Thanks.