Inline vs box for fixed size packets?

pub struct A {
  len: usize,
  data: [u8; 1300]
}

pub struct B {
  len: usize,
  data: Box<[u8; 1300]>
}

When dealing with lots of packets, is either preferred? A appears to have the flaw of "probably lots of moving / copying around"; B appears to have the flaw of "lots of small Box's fragmenting the heap; cost of allocating / dropping them"

Moving around isn't all that bad, and you can often avoid copying. Can you give some more context as to where you're using it and what sort of operations would you be doing on it?

Do most UDP servers pick one over the other ?

Well, I don't know what most do. But let's look at it this way:

  • A UDP server receives packets (which I assume have a in-wire representation of either A or B).
  • You probably want to parse the packet - this doesn't require you to copy anything (pnet handles it quite well over any slice). So A will give better cache performance.
  • You may want to store the packet. B is better for that since you are effectively storing the data out-of-line, making iteration over the entire storage more cache-friendly
  • For replying, again, A is better since you won't need copying, and can create a packet on the stack, populate it properly (using pnet perhaps) and then send it off
    So my advice would be use A for manipulation, parsing and responding while B is better for storage/caching.
2 Likes

I do not know your experience with high performance UDP servers in Rust.
Mine is 0. This is my first serious attempt.

I suspect there is a solution out there that combines the strengths of both approaches. In a very hand wavy manner, it would do something like this:

  • use some type of slab allocator to provide a buffer of packets [should be as cheap as a malloc + zero]

  • provides Boxish, which is like a Box, except, when we allocate a Boxish<Packet>, it pops one from the pool, and when we drop it, it pushes it back to the pool

  • this has the advantage that instead of moving/copying 1300 bytes around, we just move/copy a Boxish ptr, which is 8-16 bytes; the cost of alloc/drop is also just a vec pop/push

If anyone has built something like this, I would love to hear about the battle scars.

One issue is that Rust isn't great at avoiding copies when moving values. For instance, consider this function, which passes A to a function pointer by value:

pub fn pass(f: fn(A), a: A) {
    f(a)
}

As seen in Godbolt, it generates a memcpy to move its argument:

example::pass:
        sub     rsp, 1320
        mov     qword ptr [rsp], rdi
        lea     rdi, [rsp + 8]
        mov     edx, 1312
        call    memcpy@PLT
        mov     rax, qword ptr [rsp]
        lea     rdi, [rsp + 8]
        call    rax
        add     rsp, 1320
        ret

You can easily solve this by pooling your boxes. It's probably even a good thing to limit the number of in-flight packets you're processing at once anyway.

The savings from not copying around 1308-byte objects is probably well worth it.

I can definitely bound the # of packets at compile time.

Can you point me at sample code / crate that does this well (and plays well with multi threading / async io) ?

Here's an outline for how you can pool your boxes:

  1. Create an ArrayQueue for storing boxes in the pool with a size equal to the bound on the number of packets.
  2. Also create a Semaphore to wait for boxes to be returned when the bound is reached.
  3. To acquire a box from the pool, you first acquire a permit from the semaphore and then immediately forget the permit. You then pop an item from the ArrayQueue, allocating a new box if it's empty.
  4. The boxes you return should be wrapped in a custom struct with a destructor that returns the box when its no longer needed. To do this, first push it to the ArrayQueue, then add a permit to the semaphore.

Since neither ArrayQueue or Semaphore will allocate memory after they are created, this lets you reuse boxes without touching the allocator. The creation of a new box in step 3 happens only when you need more boxes than you have ever needed before, and you could also write it to allocate all of them up front when constructing the pool.

6 Likes
  1. Are the contents of the Box contiguous in memory? I.e. suppose we have 1,000,000 Box<[u32; 1024]>, does this create 1,000,000 small objects on the heap or does it allocate 1GB of memory and chop it up into 1_000_000 parts ?

  2. This might sound entitled: Is there not an existing crate for this? This seems like the type of primitive that many high networking systems would depend on, and thus it seems like there should be a well established / debugged solution.

Could the pass example be fixed by making callees responsible for the memcpy, at least with Rust ABI? Then the generated code could use the value in place and only really move it if necessary.

The only real way to do that is to pass a Box or &mut reference. Due to the differences in copy/move semantics, Rust doesn't really have an equivalent to C++'s && rvalue references.

Each box is a separate heap allocation. To have them in one heap allocation, you could use e.g. BytesMut from the bytes crate instead.

As for whether there's a crate, I don't know of one.