[Tokio] How do I improve throughput using UdpSocket?

I am trying to build a connection-oriented protocol on top of UDP (similar to quic) and I am using Tokio as the async runtime. After completing a handshake, I instantiate a Connection; this instantiates n receive threads via spawn calls. Each receive thread gets a copy of a UDPSocket with SO_REUSEPORT set, binded and connected. The receive thread performs socket.recv(&mut buf).await and sends the bytes via mpsc channels for further processing. For testing, I setup two client sessions connected to one server. Each client sends around ~10mb of data before sleeping and resuming. Without building with release, I get around ~1 gb/s total by sending through loopback interface. With release mode, I get ~3 gb/s. I want better throughput and I'm stumped on how to achieve this.

I keep n to be less than or equal to the number of cores on my machine. To my surprise, increasing n does not give me an increase in throughput. In fact, if I set n to be exactly the number of cores, the sockets appear unable to receive any more datagrams and hang. (I'll need to build a reliable protocol on top of this anyways)

I profiled my run and the flamegraph is attached below. Notable mentions include 16% of the cpu was spent on the recv syscall, 10% on serialisation-related paths but 31% was spent on tokio::runtime::scheduler::multi_thread::worker::Context::park_timeout. I'm not sure if this is normal but it seems to be taking a lot of cpu cycles. What could contribute to this?

(run on apple mac silicon)

This is not normal behavior and probably indicates some kind of bug in your program. I would ask "perhaps you are incorrectly using std::net::UdpSocket instead of tokio::net::UdpSocket?" but I can see from the flamegraph that that is not the problem.

The profiler cannot distinguish between a function that is actually computing and a function that has blocked. Judging by the name, park_timeout() is where Tokio worker threads block when they have no work to do (no async task is ready to run). Seeing the Tokio worker threads sitting in park_timeout() is not a sign of wasted cycles; it's a sign of idleness, of spare capacity.