I am trying to build a connection-oriented protocol on top of UDP (similar to quic) and I am using Tokio as the async runtime. After completing a handshake, I instantiate a Connection
; this instantiates n
receive threads via spawn
calls. Each receive thread gets a copy of a UDPSocket
with SO_REUSEPORT
set, binded and connected. The receive thread performs socket.recv(&mut buf).await
and sends the bytes via mpsc channels for further processing. For testing, I setup two client sessions connected to one server. Each client sends around ~10mb of data before sleeping and resuming. Without building with release, I get around ~1 gb/s total by sending through loopback interface. With release mode, I get ~3 gb/s. I want better throughput and I'm stumped on how to achieve this.
I keep n
to be less than or equal to the number of cores on my machine. To my surprise, increasing n
does not give me an increase in throughput. In fact, if I set n
to be exactly the number of cores, the sockets appear unable to receive any more datagrams and hang. (I'll need to build a reliable protocol on top of this anyways)
I profiled my run and the flamegraph is attached below. Notable mentions include 16% of the cpu was spent on the recv
syscall, 10% on serialisation-related paths but 31% was spent on tokio::runtime::scheduler::multi_thread::worker::Context::park_timeout
. I'm not sure if this is normal but it seems to be taking a lot of cpu cycles. What could contribute to this?
(run on apple mac silicon)