Combining work stealing and SO_REUSEPORT


#1

I was reading the thread about the difference in cpu-core saturation between Tokio and Actix implementations of a webserver here.
@vitalyd came with the following explanation.
A rough resume is that Actix’ workers each accept clients on the same bound address. Eventually this results in better distribution of work across cores at the cost of higher avg/max latency. Detailed explanation here, which was linked by @parasyte .

As i read that article i figured work-stealing might help reduce max latency for requests on blocked workers. Is this possible to build on top of current Tokio components?

At first i figured manually starting new threads with a current_thread reactor and -executor. Lacking only the work stealing component between these threads.
Tokio itself has a threadpool executor which implements work stealing, but it’s behaviour is not a perfect match. It’s desired that each reactor should always assign tasks to the same thread, not a random thread within an attached pool.
According to the Tokio runtime docs an executor instance also runs on top of a reactor. Given I use the existing threadpool implementation, this concept turns upside down. The intention, after all, would be to run a separate reactor for each thread in the pool.

So to make my question (see above) more concrete; Can the Tokio threadpool component be used underneath the Tokio reactor somehow?
Is this a correct approach? Am i overcomplicating this, because the desired functionality can actually be implemented using (most of) the current Tokio modules?


#2

I don’t see that in actix-web. What I did see was a single accept thread, which then moves the TcpStream to a worker thread; the worker thread does the I/O against that stream (i.e. read/write requests and responses). There’s no SO_REUSEPORT that I saw.

The accept loop, written using mio, is https://github.com/actix/actix-web/blob/master/src/server/srv.rs#L829-L877


#3

Ok, thanks for the clarification.
Seems like i created an entire story which isn’t grounded in truth inside my head, for which i am sorry.
This issue has become hypothetical, but i’m still interested in a possible outcome.

I’d like to explain the context for why i wanted to try this approach.
I’m about to build a multithreaded RPC server. Connections aren’t short lived and i’d gladly support more clients while sacrificing avg latency.
I’m also interested in using an actor model. In this sense the RPC server is similar to Actix-web;

  • Each connection is represented by an actor
  • Each actor has a state, which is only accessed by exactly one thread (the bound thread)

By stealing work from another thread the entire actor state is stolen as well. I’m not considering latency cost of copying this state because the system would balance itself (through an additional constraints of max accepted clients per thread).

I’m guessing the solution would eventually lie within the direction of a specifically tailored system, so building a crate out of this might be hard.
I’d like to know other peoples experiences, maybe there is already such a thing?


#4

Are you concerned that naive assignment of a connection to some worker thread may create a load imbalance? Or that load imbalance may arise as some worker thread ends up having heavier clients?

What type of work are your worker threads going to be doing? What type of load are you thinking of sustaining?

Moving load across threads is non-trivial to get right because it’s all reactive and heuristical - you notice a load imbalance after it’s already there, and attempt to redistribute. Right after you do that, the load may shift again. You can easily spiral out of control here and I suspect it’s quite hard, if not impossible, to get this perfectly right unless you happen to know something about your clients’ patterns.

Also, if your actors are movable across threads, then you’ll need to be Send which precludes use of some types that you may want/need. And this is even if you never end up needing to move them. But maybe this isn’t a concern.

My suggestion would be to not worry about this right now. I’d probably pick a design similar to actix-web (ie accept thread moving conns to workers that handle I/O and execution of requests) as it’s not too complex and is a good baseline setup.


#5

At this point i’m not concerned about anything, my intention is to experiment with this approach. The promise is to balance work more evenly across all available cpu-cores, depending on how the kernel alternates accept calls between workers. This should then result in effectively handling more clients on the same machine.

Edit: According to the next quote, my expectation is wrong; it’s possible to handle more clients at the same time. This is NOT the same as handling more clients in total. This leads to the possibility of handling more clients under specific circumstances.

There will be minimal blocking, mostly because of logging and limited shared state access.
Database reads/writes, which will block, are ‘offloaded’ to other threads to simulate async. I have actually no idea about the impact of additional threads for blocking database operations.

Correct, it’s a post incident approach to rebalancing.

Actor state is required to be Send + 'static while using Tokio::runtime (reactor thread + threadpool executor). Changing the underlying runtime into the described approach would not require changes to the upper implementation, i suspect.

I also remembered this post from Tokio, mentioning that moving the entire event loop to an other thread is often cheaper (i suppose lower latency). This is somewhat into the direction of what i’m trying to achieve, but it moves all other tasks onto another thread instead of redistributing them over the other workers. It’s also backwards in the sense that a blocking operation explicitly executes this behaviour instead of allowing the other workers to share in the load.
It peeked my interest anyway and decided to do a deep dive into the Tokio -executor, -reactor and -threadpool crates, to find out how this (it all) works in detail. The deep dive was inevitable, now that i think about it…
I think i can cook up something soon.

Thanks for the replies!


#6

FWIW, actix-web uses a dedicated arbiter (actor thread) for serializing database requests. This is the no-frills common actor setup for handling shared resources. You can see it at work in the Diesel example.


#7

I’m not sure SO_REUSEPORT will guarantee better load distribution. AFAIK, for tcp it uses the 4 tuple to hash the incoming connections to a particular listener.

My understanding is that this option is mostly an optimization for servers seeing a flood of new connections - that’s not a typical server; most servers are bottlenecked on servicing the connections, not accepting them.

The other potential benefit of it is if it works in conjunction with NIC level packet (flow) steering. If the connection hashing and packet steering are aligned, then it’s possible to create a setup where a new connection is steered to a particular listener on a given cpu, and then all subsequent packets are steered to it as well if the NIC is using the same hashing. This can ensure that a packet bound for a listener running cpu X is never seen by any other cpu from start to finish (assuming the listeners are also affinitized to cpus so kernel doesn’t migrate them around).

Anyway, if you’re seeing packet rates where things like the above really matter, then you may be at a place where considering kernel bypass networking is the right solution and then this is moot.

It’d be interesting to experiment with this though. IIRC, the seastar guys saw poor load distribution with SO_REUSEPORT and turned it off.

Send is not a requirement if you use the current_thread reactor + executor, however. But sure, this may not matter either way.