BufferPool (fast pre-allocated buffers)

Hi,

I would like to have some feedback on BufferPool

I developed BufferPool to replace the system allocator in some particular use cases, from the
benches it seem to behave well being at max 4.3 time faster than a naive solution.
There are some unsafe parts in the code so the crate is fuzzy tested (I ran them for several hours
and everything seems ok)

I would appreciate some feedback in:

  1. unsafe parts, how dangerous they are and if there are solution to remove them.
  2. the overall approach.
  3. if there are similar crates, or crates that uses a different approach to achieve the same
    goal so that I can put them in the readme.

Below an extract of the readme:

Intro

BufferPool is useful whenever we need to work with buffers sequentially (fill a buffer, get
the filled buffer, fill a new buffer, get the filled buffer, and so on).

To fill a buffer BufferPool returns an &mut [u8] with the requested len (the filling part).
When the buffer is filled and the owner needs to be changed, BufferPool returns a Slice that
implements Send and AsMut<u8> (the get part).

BufferPool pre-allocates a user-defined capacity in the heap and use it to allocate the buffers,
when a Slice is dropped BufferPool acknowledge it and reuse the freed space in the pre-allocated
memory.

Implementation

The crate is [no_std] and lock-free, so, to synchronize the state of the pre-allocated memory
(taken or free) between different contexts an AtomicU8 is used.
Each bit of the u8 represent a memory slot, if the bit is 0 the memory slot is free if is
1 is taken. Whenever BufferPool creates a Slice a bit is set to 1 and whenever a Slice is
dropped a bit is set to 0.

Use case

BufferPool has been developed to be used in proxies with thousand of connection, each connection
must parse a particular data format via a decoder each decoder use 1 or 2 buffer for each received
message. With BufferPool each connection can be instantiated with its own BufferPool and reuse
the space freed by old messages for new ones.

Unsafe

There are 3 unsafes:
buffer_pool/mod.rs 593
slice.rs 8
slice.rs 27

Performance

To run the benchmarks cargo bench --features criterion.

To have an idea of the performance gains, BufferPool is benchmarked against
BufferFromSystemMemory and two control structures PPool and MaxEfficeincy.

PPool is a buffer pool implemented with a hashmap and MaxEfficeincy is a Buffer implemented in the fastest possible way so that the benches do not panic and the compiler does not complain. Btw they are both completely broken, useful only as references for the benchmarks.

The benchmarks are:

Single thread

for 0..1000:
  add random bytes to the buffer
  get the buffer
  add random bytes to the buffer
  get the buffer
  drop the 2 buffer

Multi threads (this is the most similar to the actual use case IMHO)

for 0..1000:
  add random bytes to the buffer
  get the buffer
  send the buffer to another thread   -> wait 1 ms and then drop it
  add random bytes to the buffer
  get the buffer
  send the buffer to another thread   -> wait 1 ms and then drop it

Multi threads 2

for 0..1000:
  add random bytes to the buffer
  get the buffer
  send the buffer to another thread   -> wait 1 ms and then drop it
  add random bytes to the buffer
  get the buffer
  send the buffer to another thread   -> wait 1 ms and then drop it
  wait for the 2 buffer to be dropped

Test

Some failing cases from fuzz.

From the benchmark in BENCHES.md executed for 2000 samples:

* single thread with  `BufferPool`: ---------------------------------- 7.5006 ms
* single thread with  `BufferFromSystemMemory`: ---------------------- 10.274 ms
* single thread with  `PPoll`: --------------------------------------- 32.593 ms
* single thread with  `MaxEfficeincy`: ------------------------------- 1.2618 ms
* multi-thread with   `BufferPool`: ---------------------------------- 34.660 ms
* multi-thread with   `BufferFromSystemMemory`: ---------------------- 142.23 ms
* multi-thread with   `PPoll`: --------------------------------------- 49.790 ms
* multi-thread with   `MaxEfficeincy`: ------------------------------- 18.201 ms
* multi-thread 2 with `BufferPool`: ---------------------------------- 80.869 ms
* multi-thread 2 with `BufferFromSystemMemory`: ---------------------- 192.24 ms
* multi-thread 2 with `PPoll`: --------------------------------------- 101.75 ms
* multi-thread 2 with `MaxEfficeincy`: ------------------------------- 66.972 ms

From the above numbers, it results that BufferPool always outperform the hashmap buffer pool and
the solution without a pool:

Single thread

If the buffer is not sent to another context BufferPool is 1.4 times faster than no pool and 4.3 time
faster than the hashmap pool and 5.7 times slower than max efficiency.

Multi thread

If the buffer is sent to other contexts BufferPool is 4 times faster than no pool, 0.6 times faster
than the hashmap pool and 1.8 times slower than max efficiency.

Fuzzy tests

Run them with cargo fuzz run slower -- -rss_limit_mb=5000000000 and
cargo fuzz run faster -- -rss_limit_mb=5000000000

BufferPool is fuzzy tested with cargo fuzzy. The test checks if slices created by BufferPool
still contain the same bytes contained at creation time after a random amount of time and after been
sent to other threads. There are 2 fuzzy test, the first (faster) it map a smaller input space to
test the most likely inputs. The second (slower) it have a bigger input space to pick "all" the
corner case. The slower also forces the buffer to be sent to different cores. I ran both for several
hours without crashes.

The test must be run with -rss_limit_mb=5000000000 cause they check BufferPool with capacities from 0 to 2^32.

From buffer_pool/mod.rs 225 (but also applicable to buffer_pool/mod.rs 275)

               let src_offset =
                   (&self.pool[self.raw_offset..self.raw_offset + self.raw_len]).as_ptr();
               let dest = &mut self.pool[0..self.raw_len];
               unsafe {
                   let src = core::slice::from_raw_parts(src_offset, self.raw_len);
                   dest.copy_from_slice(src);
               };

Even though it will probably work (for the reason you used to justify it's soundness), this is actually UB. Have you considered using a safe alternative like <[T]>::copy_within?

self.pool.copy_within(self.raw_offset..self.raw_offset+self.raw_len, 0);
1 Like

Ty! This seem to be a great improvement. I will run some benchmark with the modified version, but I think that the overhead of memmove over memcpy will be negligible.

@SkiFire13 I'm updating the code with your solution:

  1. it remove an UB
  2. the address sanitizer do not complain anymore
  3. the code is more clear
  4. test and fuzzy test are passing
  5. bench for 500 samples say that performances are very similar (-1.5 % in single thread, -0.1% in multi thread)
1 Like