Hi,
I would like to have some feedback on BufferPool
I developed BufferPool
to replace the system allocator in some particular use cases, from the
benches it seem to behave well being at max 4.3 time faster than a naive solution.
There are some unsafe parts in the code so the crate is fuzzy tested (I ran them for several hours
and everything seems ok)
I would appreciate some feedback in:
- unsafe parts, how dangerous they are and if there are solution to remove them.
- the overall approach.
- if there are similar crates, or crates that uses a different approach to achieve the same
goal so that I can put them in the readme.
Below an extract of the readme:
Intro
BufferPool
is useful whenever we need to work with buffers sequentially (fill a buffer, get
the filled buffer, fill a new buffer, get the filled buffer, and so on).
To fill a buffer BufferPool
returns an &mut [u8]
with the requested len (the filling part).
When the buffer is filled and the owner needs to be changed, BufferPool
returns a Slice
that
implements Send
and AsMut<u8>
(the get part).
BufferPool
pre-allocates a user-defined capacity in the heap and use it to allocate the buffers,
when a Slice
is dropped BufferPool
acknowledge it and reuse the freed space in the pre-allocated
memory.
Implementation
The crate is [no_std]
and lock-free, so, to synchronize the state of the pre-allocated memory
(taken or free) between different contexts an AtomicU8
is used.
Each bit of the u8
represent a memory slot, if the bit is 0 the memory slot is free if is
1 is taken. Whenever BufferPool
creates a Slice
a bit is set to 1 and whenever a Slice
is
dropped a bit is set to 0.
Use case
BufferPool
has been developed to be used in proxies with thousand of connection, each connection
must parse a particular data format via a decoder each decoder use 1 or 2 buffer for each received
message. With BufferPool
each connection can be instantiated with its own BufferPool
and reuse
the space freed by old messages for new ones.
Unsafe
There are 3 unsafes:
buffer_pool/mod.rs 593
slice.rs 8
slice.rs 27
Performance
To run the benchmarks cargo bench --features criterion
.
To have an idea of the performance gains, BufferPool
is benchmarked against
BufferFromSystemMemory
and two control structures PPool
and MaxEfficeincy
.
PPool
is a buffer pool implemented with a hashmap and MaxEfficeincy
is a Buffer
implemented in the fastest possible way so that the benches do not panic and the compiler does not complain. Btw they are both completely broken, useful only as references for the benchmarks.
The benchmarks are:
Single thread
for 0..1000:
add random bytes to the buffer
get the buffer
add random bytes to the buffer
get the buffer
drop the 2 buffer
Multi threads (this is the most similar to the actual use case IMHO)
for 0..1000:
add random bytes to the buffer
get the buffer
send the buffer to another thread -> wait 1 ms and then drop it
add random bytes to the buffer
get the buffer
send the buffer to another thread -> wait 1 ms and then drop it
Multi threads 2
for 0..1000:
add random bytes to the buffer
get the buffer
send the buffer to another thread -> wait 1 ms and then drop it
add random bytes to the buffer
get the buffer
send the buffer to another thread -> wait 1 ms and then drop it
wait for the 2 buffer to be dropped
Test
Some failing cases from fuzz.
From the benchmark in BENCHES.md executed for 2000 samples:
* single thread with `BufferPool`: ---------------------------------- 7.5006 ms
* single thread with `BufferFromSystemMemory`: ---------------------- 10.274 ms
* single thread with `PPoll`: --------------------------------------- 32.593 ms
* single thread with `MaxEfficeincy`: ------------------------------- 1.2618 ms
* multi-thread with `BufferPool`: ---------------------------------- 34.660 ms
* multi-thread with `BufferFromSystemMemory`: ---------------------- 142.23 ms
* multi-thread with `PPoll`: --------------------------------------- 49.790 ms
* multi-thread with `MaxEfficeincy`: ------------------------------- 18.201 ms
* multi-thread 2 with `BufferPool`: ---------------------------------- 80.869 ms
* multi-thread 2 with `BufferFromSystemMemory`: ---------------------- 192.24 ms
* multi-thread 2 with `PPoll`: --------------------------------------- 101.75 ms
* multi-thread 2 with `MaxEfficeincy`: ------------------------------- 66.972 ms
From the above numbers, it results that BufferPool
always outperform the hashmap buffer pool and
the solution without a pool:
Single thread
If the buffer is not sent to another context BufferPool
is 1.4 times faster than no pool and 4.3 time
faster than the hashmap pool and 5.7 times slower than max efficiency.
Multi thread
If the buffer is sent to other contexts BufferPool
is 4 times faster than no pool, 0.6 times faster
than the hashmap pool and 1.8 times slower than max efficiency.
Fuzzy tests
Run them with cargo fuzz run slower -- -rss_limit_mb=5000000000
and
cargo fuzz run faster -- -rss_limit_mb=5000000000
BufferPool
is fuzzy tested with cargo fuzzy
. The test checks if slices created by BufferPool
still contain the same bytes contained at creation time after a random amount of time and after been
sent to other threads. There are 2 fuzzy test, the first (faster) it map a smaller input space to
test the most likely inputs. The second (slower) it have a bigger input space to pick "all" the
corner case. The slower also forces the buffer to be sent to different cores. I ran both for several
hours without crashes.
The test must be run with -rss_limit_mb=5000000000
cause they check BufferPool
with capacities from 0 to 2^32.