Buffer pooling for async IO?

Hi, looking at the Tokio's AsyncRead trait, I'm wondering how to use it at scale.

I had to build a Websocket server and wondered how efficient it was. I remembered a golang talk about handling milions of websockets and how not having one buffer per connection tremendously reduces the memory footprint of the server.

From what I understand basic use of something implementing AyncWrite would look like this:

use tokio::io::AsyncReadExt;

async {
    let reader: impl AsynRead = ...;
    let mut buf = vec![0; buf_size];
    let written = reader.read(&mut buf).await?;
    // Do something with buf[..written]
}

The issue is that if you have a large number of such tasks (say n), you now have n*buf_size memory allocations. So either you make buf_size really small, which would be detrimental to performance when data arrives or you need a lot of ram.
However, given the way AsyncRead is designed, it should accept the buffer changing on each poll_read call, so we could very well have in AsyncReadExt a read_pooled<'a>(&'a mut self, &'a Pool) -> ReadPooled<a, Self>`.

ReadPooled would implement Future<Output = io::Result<Vec<u8>>> (or with one of bytes's structures) with something like this:

impl<S: AsyncRead> Future for ReadPooled<'_,S> {
    fn poll(self: Pin<&mut self>, cx: &mut Context<'_>) -> Poll<io::Result<Vec<u8>>> {
        // Borrow a reader from the Pool.
        // Depending on the pool, this gives some wrapper type around a form mutable reference to a buffer, maybe a `MutexGuard<Vec<u8>>` or something
        // On drop it is returned to the pool for reuse in the next `poll` call of this Future or any other that uses the same pool
        let buf = self.pool.borrow();
        match self.stream.poll_read(&mut buf) {
            Pending => return Pending, //The buffer is returned to the pool
            Ready(Err(e)) => return Ready(Err(e)),
            Ready(Ok(()) => return Ready(Ok(buf.take())) // buf.take takes ownership of the buffer away from the Pool (which will likely just reallocate a new buffer to replace it).
        }
    }
} 

There are some subtleties with Pin which I ignored but I don't think would be a problem. The idea is that you would use the same Pool in many tasks. The pool itself can do with only a few buffers (maybe as many as you have executor threads), given that very few tasks will be in poll at the same time.

In my case with websockets, tokio-tungstenite doesn't seem to do anything like this (from what I understand it uses 4096 bytes read buffers for each socket) so I expect memory usage to be quite high with a lot of websocket connections open.

I don't seem to find anything for this kind of pattern. You'd probably need a type (or trait) for Pool. While there had been mentions of pooling for the bytes crate, nothing seems to have come out of it. Is there something I'm missing on why would such a pattern not be used for cases where a lot of parallel connections are used concurrently?

I've seen people do this using Tokio's try_read family of functions, but pooling with poll_read usage would also be possible. I haven't seen any crates for something like this.

In Windows, async operations need the (same!) buffer available during the entire read, so IIRC tokio needs to internally buffer, and copy out on a Ready poll to satisfy AsyncRead being able to do this buffer-swapping on poll? It would be nice to just have a natively buffer returning API to skip that if so.