Getting tokio to match actix-web performance


#1

I wrote a terse HTTP webserver using tokio. It’s hard for it be simpler (it cheats and doesn’t pay attention to request headers; and it doesn’t send a Date header back :stuck_out_tongue: ). And yet, actix-web is outperforming it by a decent margin in benchmarks. Here’s the whole code:

The actix benchmark code from TechEmpower’s benchmark gets 50k req/s in my tests; my code gets 43k req/s. This is on cloud instances, 4096 keep alive connections, wrk, 2 vCPUs per machine.

Can anyone think of where this difference could come from?

Because of the use of keep-alive connections I think that none of the performance difference is due to the way the listener is setup. So the fact that actix-web uses a different backlog setting and possibly multiple listeners doesn’t seem to matter (though I did try a few modifications of my code to test and didn’t see much difference). So it must be something else. actix appears to mostly use mio directly rather than tokio’s runtime and reactor. Does using tokio have that much overhead?

Here’s the full text of the benchmark results:

## actix, no pipeline:

Running 15s test @ http://10.142.0.3:8080/plaintext
  2 threads and 4096 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    10.09ms  505.14us  23.99ms   70.69%
    Req/Sec    50.47k     2.10k   55.84k    65.33%
  Latency Distribution
     50%   10.15ms
     75%   10.46ms
     90%   10.63ms
     99%   11.09ms
  1506079 requests in 15.05s, 186.72MB read
  Socket errors: connect 3077, read 0, write 0, timeout 0
Requests/sec: 100059.44
Transfer/sec:     12.41MB


## tokio-raw, no pipeline:

Running 15s test @ http://10.142.0.3:8080/plaintext
  2 threads and 4096 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    11.52ms    7.57ms 237.87ms   63.74%
    Req/Sec    43.86k     3.55k   49.92k    62.67%
  Latency Distribution
     50%   11.40ms
     75%   17.27ms
     90%   21.29ms
     99%   23.96ms
  1309184 requests in 15.05s, 97.39MB read
  Socket errors: connect 3077, read 0, write 0, timeout 0
Requests/sec:  86962.04
Transfer/sec:      6.47MB

You can also see that the latency distribution for actix is flat, while there are a number of slower requests for my code. :frowning_face:

I’ve tested tokio-minihttp, hyper, and the HTTP example in the tokio repo. All perform even worse than my code, let alone actix.

TechEmpower’s benchmark confirms this: https://www.techempower.com/benchmarks/#section=data-r16&hw=cl&test=plaintext You can see that on Cloud hardware actix-raw outperforms tokio-minihttp. (NOTE: This is not the case on Physical hardware benchmarks, but I suspect that this is due to the hardware used in Physical tests being more powerful, thus the benchmarks hit the network bottleneck and so you can’t see their difference in performance anymore).

In practical terms, I’m not concerned about performance at these levels on artificial benchmarks. But I am curious as to why my exceedingly simple code is being outperformed by code that is doing an order of magnitude more work.

Thank you!


Combining work stealing and SO_REUSEPORT
#2

This is probably unrelated, but have you tried using a Vec<u8> as the buffer rather than an inline 4k array?


#3

You can try using the memchr crate to speed up looking for the line terminators.


#4

I did some basic benchmark analysis on tokio a few months ago. In short, I am unable to saturate all worker threads with h2load over localhost. Total RPS caps out at about 250K on my MacBook Pro. Compared to actix-web with over 1 million RPS and fully saturated CPUs. I haven’t been able to explain the behavior. I suspect it’s a bottleneck in the listener/accept handler being unable to keep all workers busy.


#5

It looks like actix-web runs a separate single threaded reactor/executor on each http worker thread. Once an accepted connection is moved to a given worker thread (simple round robin), all I/O with that connection is done on that worker thread. This is the same model as nginx.

Tokio, on the other hand, by default runs futures on a threadpool but I/O readiness is done by a dedicated thread. The problem with this setup is if you have mostly I/O bound workloads, as is the case in this thread, there’s going to be a lot of cross-thread communication overhead. This likely is a reason for a lot of the perf difference.

I’d try to use an I/O model closer to what actix-web uses and then see how the results look.


#6

I haven’t tried Vec yet, but I swapped in BytesMut with various reserve sizes in the function that reads the socket, but that didn’t have any effect. Do you still think Vec might perform better? If so what implementation are you thinking of (just initialize a fixed size Vec and keep the code as-is, or extend and erase from the Vec as lines come in and out?)

At least in my case both actix-web and my code pin the CPUs to 100%. But I’m using wrk to benchmark. It just opens a bunch of connections and keeps them alive for the duration of the test. So there’s no load on the listen/accept logic. Just request/response processing.

I’m still wrapping my head around Tokio (their code being constantly in flux and fragmented across many modules isn’t helping :stuck_out_tongue:). So if I understand correctly, for example, with a threadpool of 2 Tokio would actually have 3 threads running? 2 workers and 1 handling I/O events and passing them to the workers? Man … that’s some nasty synchronization.

I did try an experiment where I spawned two separate Tokio runtimes with two separate listeners (REUSEPORT, 1 thread each). Didn’t see any improvement there. But based on your description that would explain why. Even under that implementation events are still coming in on different threads from the workers. Not really how I expected Tokio to work…

By the way, if I/O heavy workloads are pathologic for Tokio … well isn’t Tokio mostly going to be used for I/O heavy workloads? What else would you need Tokio for if not I/O heavy workloads?


#7

That was mostly just a side comment. One concern with an embedded/inline 4k array is the cost to move the host struct will go up as it has to memcpy all that (this assumes optimizer doesn’t eliminate the move).

Yeah, this landscape (along with the rest of async I/O) is very much still being painted with broad brushstrokes.

Essentially, yes. If you search for “tokio reform” you’ll see a bunch of discussions around the current default tokio setup.

They’re probably pathological for the default tokio setup, which is the one using a dedicated reactor thread + threadpool for execution. There’s a current_thread module in tokio where you can create a reactor + executor to be the same thread (this was the only mode in the previous tokio version, called tokio_core). Take a look at:

  1. https://docs.rs/tokio/0.1.7/tokio/runtime/current_thread/index.html
  2. https://docs.rs/tokio/0.1.7/tokio/executor/current_thread/index.html

The current_thread executor (#2 there) is what actix-web appears to use on its http worker threads: https://github.com/actix/actix-web/blob/master/src/server/worker.rs#L185. That spawned HttpChannel then does all subsequent I/O with the peer; you can browse the channel code in https://github.com/actix/actix-web/blob/master/src/server/channel.rs


Combining work stealing and SO_REUSEPORT
#8

Really, how were you able to saturate your CPUs if the bottleneck is in a single thread handling I/O events?

Sorry for the confusion. I was also using tcp and http keep-alives with h2load. I brought up multiple listen sockets because this load balancing model is commonly configured so each thread is given its own I/O handler. The load balancing topic is discussed in-depth in this article: https://blog.cloudflare.com/the-sad-state-of-linux-socket-balancing/

Aside: The article points out that you will see higher worse-case latency with new connections on multiple accept queues, but better load distribution. You don’t want multiple listen sockets for its ability to handle more incoming connections, but so you get better load balancing and therefore higher throughput.


Combining work stealing and SO_REUSEPORT
#9

This sound a lot like any HTTP server micro-benchmark comparison where there are differences involving threads and async vs. sync designs. For example, on Java: Tomcat vs. Jetty …(edit) though things have probably changed since last I cared about that. Typically the fully asynchronous/reduced-threads design wins on stability and memory utilization, but can loose out in certain micro-benchmarks on raw latency or throughput.


#10

That did the trick, @vitalyd! Thank you for the insights.

I’ve moved the code into a repo to track better: https://github.com/fpgaminer/rust-http-benchmarks

In that repo actix is the same code that TechEmpower uses, so it’s a good baseline for what actix-web can do (TechEmpower also has an actix-raw which performs better; I’ll have to bring that in later). tokio-1 is using default Tokio. tokio-2 is the new code that spawns two threads each with its own listener and CurrentThread tokio Runtime.

Running fresh benchmarks:

actix

./wrk -H 'Host: 10.142.0.3' -H "Accept: text/plain,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7" -H "Connection: keep-alive" --latency -d 15 -c 4096 --timeout 8 -t 2 http://10.142.0.3:8080/plaintext
Running 15s test @ http://10.142.0.3:8080/plaintext
  2 threads and 4096 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     7.55ms    1.28ms  13.26ms   85.59%
    Req/Sec    63.05k     2.47k   67.24k    63.00%
  Latency Distribution
     50%    7.91ms
     75%    8.07ms
     90%    8.28ms
     99%    9.38ms
  1886637 requests in 15.10s, 233.90MB read
  Socket errors: connect 3077, read 0, write 0, timeout 0
Requests/sec: 124935.35
Transfer/sec:     15.49MB

tokio-1: Using straightforward tokio

./wrk -H 'Host: 10.142.0.3' -H "Accept: text/plain,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7" -H "Connection: keep-alive" --latency -d 15 -c 4096 --timeout 8 -t 2 http://10.142.0.3:8080/plaintext
Running 15s test @ http://10.142.0.3:8080/plaintext
  2 threads and 4096 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     8.13ms    6.07ms  30.57ms   48.92%
    Req/Sec    60.03k     2.40k   64.43k    78.33%
  Latency Distribution
     50%    7.33ms
     75%   14.59ms
     90%   16.35ms
     99%   17.36ms
  1794082 requests in 15.07s, 133.46MB read
  Socket errors: connect 3077, read 0, write 0, timeout 0
Requests/sec: 119045.05
Transfer/sec:      8.86MB

tokio-2: Using two threads each with its own CurrentThread tokio Runtime

./wrk -H 'Host: 10.142.0.3' -H "Accept: text/plain,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7" -H "Connection: keep-alive" --latency -d 15 -c 4096 --timeout 8 -t 2 http://10.142.0.3:8080/plaintext
Running 15s test @ http://10.142.0.3:8080/plaintext
  2 threads and 4096 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     4.05ms    1.36ms  22.09ms   82.34%
    Req/Sec    68.96k     4.02k   73.78k    85.14%
  Latency Distribution
     50%    3.54ms
     75%    4.14ms
     90%    6.65ms
     99%    7.89ms
  2058064 requests in 15.05s, 153.09MB read
  Socket errors: connect 3077, read 0, write 0, timeout 0
Requests/sec: 136732.81
Transfer/sec:     10.17MB

So the CurrentThread approach gives us higher throughput and better latency compared to actix. Nice. (Of course, again, my code is cheating and doing no real parsing or request generation, but at least the benchmarks now reflect that cheating rather than my code being slower than actix-web).


#11

This is a case where the benchmarks are actually harmful. Benchmarks are largely marketing: “look this framework is fastest”. Unfortunately, the benchmarks aren’t realistic, and they tend to show off performance in an ideal world where everything is perfectly fair. Since all the connections will be created at the start, and all of them will produce the exact same amount of load, an execution strategy that simply divides the connections evenly among threads and pins them there will handle best, since no more synchronization is required.

In the real world, connections and requests should not be treated equally. Connections come and go every second. Some connections show up to submit 1 request that requires a lot of work (load a big list from the db, and then filter and compute before turning as JSON), others show up with 1 super fast request (like a heartbeat check), others show up lasting longer asking for a mix of heavy and light requests.

In these circumstances, a strategy that simply round robins connections to threads, can easily end up in situations where some threads are super overloaded, and others are near idle. When a heartbeat request comes along, and because of round robin, is assigned a thread that’s already got several heavy requests in progress, the heartbeat might take a few seconds to get a response, suggesting something is wrong, even though other threads are near idle.

When using tokio’s default runtime, you instead have a work-stealing executor. As some threads finish their lighter loads, they will look at the overloaded threads queue of work and start stealing it, meaning your server will more appropriately use its resources.


#12

Of relevance, just noted today via @jonhoo’s twitter, the tokio_io_pool crate

https://docs.rs/tokio-io-pool/0.1.1/tokio_io_pool/


#13

I’m not sure this is a real world vs benchmark situation - it’s just a YMMV case depending on your application. The good thing is tokio gives one the options to pick from accordingly.

It also shows amount of overhead involved in the threadpool model (either inherent and/or incidental due to implementation).