Posting here in case somebody can explain the fine nuances of Actix CPU efficiency and per-request memory usage as number of concurrent requests rises. (or for any other thoughts - I hope this is fine on URLO)
For the CPU efficiency question, is it possible CPU frequency scaling is at play? No clever ideas for the per-request memory usage behavior, but it's pretty interesting.
So much detail there I don't know what to make of it.
I notice though that your commentary and graphs etc go to 1024 simultaneous connections.
In my naivety this seems like small fry. The problem of tackling 10,000 connections was already a thing in 1999. https://en.wikipedia.org/wiki/C10k_problem. When we had much smaller, slower machines.
That C10k problem is the reason why Rust has invested so much effort into async programming.
So, in my naivety, it looks like you are two or three orders of magnitude short of pushing what we expect of modern machines to the limit.
Or what am I missing here?
Good point. CPU frequency scaling (or boosting) is indeed employed even on CGP:
N2D machine types run on AMD EPYC Rome processors with a base frequency of 2.25 GHz, an effective frequency of 2.7 GHz, and a max boost frequency of 3.3 GHz.
If I turn my head correctly about it, frequency boosting should have the following effect: when the load is low (that is 1 to 4 connections where actix does not yet saturate a single CPU core), cpu time milliseconds per request should be lower, as CPU could boost frequency and actually execute more ticks per millisecond.
However, on the graph we see an opposite effect, cpu time per request is highest for 1 parallel connection. A mystery to me.
The "C10k problem" is solved at a different level in the microservice architecture: the actual microservice instances are shielded by a load balancer (with potentially dedicated hardware), which distributes incoming requests to ever changing number of microservice instances. Load balancers also usually aggregate a lot of client connections into a single/few connections to the microservice. When the load is higher, number of instances can be raised (this is called scaling horizontally). The metric to optimise in microservices is thus requests per second, or rather requests per second per a unit of available resources.
Also note that the tested service deliberately has access only to 1.5 of CPU cores, which is much less than usual modern machine. Actix TechEmpower benchmarks yield e.g. 650,000 requests/s (at 512 connections) per 28 hyperthreads = ~23,000 req/s per core, comparable to my benchmark's ~7,300 req/s per core.
Pentium Dual Core brand appeared in 2007, so in 1999 I'm pretty sure most of us lives with the single core machines. And it serves about 10K concurrent connections pretty well on single machine without load balancer. Concurrent connection means the number of in-processing connections at single instant. Most HTTP connections don't last more than a second so it would be far less than the req/s number.
Sorry yes, I have started to talk about something somewhat different but related to your theme.
To a first approximation I interpret the C10k problem as to be asking how do we even maintain a thousand/million connections on a single machine and wait on input from them at all? Never mind doing any actual work to fulfill the requests. That's before start measuring performance in requests/second or whatever.
Previously it was common to fire up threads or even processes to handle connections and have them waiting on blocking I/O. This does not scale to thousands/millions of connections what with all the memory consumed by even starting a thread and all the context switching going on.