Simple *and* fast channel? Too good to be true

Hi all!

I built a bespoke, simple, channel implementation for use in a project, and I decided to benchmark it against other offerings.
I was somewhat surprised to find it can outperform most offerings, but also that flume/crossbeam also do not perform as well as I would expect them to against std_mpsc.

I believe that there must be something flawed in either my implementation or benching setup, and I would appreciate any explanations!

  • Why is bufchan performing so well?
  • Is my implementation flawed?
  • Is the benching setup flawed?
  • Why do crossbeam/flume not perform well?

Benchmarks:

bench            fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ int                         │               │               │               │         │
│  ├─ bufchan    5.854 ms      │ 12.56 ms      │ 6.145 ms      │ 6.619 ms      │ 100     │ 100
│  ├─ crossbeam  36.37 ms      │ 66.97 ms      │ 57.17 ms      │ 56.74 ms      │ 100     │ 100
│  ├─ flume      137.1 ms      │ 144.3 ms      │ 140.9 ms      │ 140.8 ms      │ 100     │ 100
│  ├─ kanal      16.6 ms       │ 31.93 ms      │ 23.38 ms      │ 24.34 ms      │ 100     │ 100
│  ╰─ std_mpsc   51.75 ms      │ 77.38 ms      │ 69.05 ms      │ 64.14 ms      │ 100     │ 100
╰─ non_copy                    │               │               │               │         │
   ├─ bufchan    51.47 ms      │ 172.2 ms      │ 93.13 ms      │ 96.92 ms      │ 100     │ 100
   ├─ crossbeam  89.72 ms      │ 154.1 ms      │ 120.8 ms      │ 128.7 ms      │ 100     │ 100
   ├─ flume      507.5 ms      │ 526.1 ms      │ 521.7 ms      │ 521.1 ms      │ 100     │ 100
   ├─ kanal      262.8 ms      │ 340.7 ms      │ 321.4 ms      │ 320.5 ms      │ 100     │ 100
   ╰─ std_mpsc   68.34 ms      │ 88.19 ms      │ 70.11 ms      │ 71.23 ms      │ 100     │ 100

Keep in mind that bufchan was built to provide high throughput of a parallelised heavy computation result, the idea was to avoid blocking the sender as much as possible to keep the CPU go brrr.
bufchan is also incredibly simple, no unsafe, very easy to follow, which seems like a positive??

Repo:

It looks like this will busy-loop when waiting for messages? That's pretty bad for real-world performance even if it looks good on benchmarks.

2 Likes

I'm using it in production and it is noticeably faster than flume (around 25%). In the production use case I have computations (taking a few hundred ms) sending results, across 12 threads. It seemed with flume that the send call was choking the compute threads.

It looks like I can add Condvar in parking_lot - Rust without adding too much complexity

I altered the receive loop to use CondVar. This still gives similar performance. I wonder if there is another way to benchmark the throughput?

bench            fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ int                         │               │               │               │         │
│  ├─ bufchan    5.771 ms      │ 11.25 ms      │ 6.013 ms      │ 6.344 ms      │ 100     │ 100
│  ├─ crossbeam  41.33 ms      │ 66.28 ms      │ 58.08 ms      │ 58.42 ms      │ 100     │ 100
│  ├─ flume      137 ms        │ 145.5 ms      │ 140.4 ms      │ 140.4 ms      │ 100     │ 100
│  ├─ kanal      18.48 ms      │ 32.61 ms      │ 23.77 ms      │ 25.36 ms      │ 100     │ 100
│  ╰─ std_mpsc   51.74 ms      │ 71.98 ms      │ 67.66 ms      │ 63.28 ms      │ 100     │ 100
╰─ non_copy                    │               │               │               │         │
   ├─ bufchan    52.54 ms      │ 178.6 ms      │ 89.24 ms      │ 95.98 ms      │ 100     │ 100
   ├─ crossbeam  99.77 ms      │ 155.2 ms      │ 119.8 ms      │ 129.1 ms      │ 100     │ 100
   ├─ flume      474.6 ms      │ 529.8 ms      │ 521.9 ms      │ 521.2 ms      │ 100     │ 100
   ├─ kanal      254.5 ms      │ 338.5 ms      │ 321.5 ms      │ 319.2 ms      │ 100     │ 100
   ╰─ std_mpsc   67.78 ms      │ 77.58 ms      │ 69.56 ms      │ 69.99 ms      │ 100     │ 100

in your implementation, the sender is sending the data in batches, it's just the classic latency vs throughput tradeoff. I doubt your implementation would perform significantly different if you turn your batch size down to 1.

can't be sure at a glance, and it also depends on the intended use cases. but your implementation can only be used when the sender is known to produce finite items and would eventually drop the sender. I don't see a flush method or timer event, so if the producer does not finish and the number of items is not multiple of the batch size, some of the data will sit in the local queue forever.

again, flawed depends on intended use cases.

I think mostly because a general purpose library cannot assume how the producer behaves, so they cannot trade latency for throughput unconditionally. they may have specialized APIs, but I'm not familiar with these libraries.

as I said, it's a trade off between throughput vs latency. a specialized library can easily out perform a general purpose implementation, it all tradeoffs after all.

if all other aspects are similar, then an implementation without unsafe is definitely a plus.

1 Like

I would say because it batches send operations, which is nice for performance but very surprising and can introduce deadlocks, for example consider two tasks, each waiting for a message from the other which is however still being buffered.

Not sure what your benchmarking setup is doing, but the fold with the overflowing add and no std::hint::black_box seems suspicious.

I don't know, but I have to say this is very suspicious given that std's mpsc are now based on crossbeam, so they should give very similar results.

2 Likes

Good point, a flush function would be warranted.

That's a good idea to add.


In general, since this implementation is tailored to throughput, it seems that it is a valid approach for intensive compute tasks.

So this made quite a big difference and levelled the playing field a fair bit.

  • Firstly, bufchan performs consistently the best, even when not buffered.
  • crossbeam and std now have almost identical results.
  • It seems that crossbeam/std must perform some sort of optimisation for small types (probably within a word I guess)

Overall, I think for 100 lines of implementation and no unsafe, it still seems too good to be true...

bench                 fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ int                              │               │               │               │         │
│  ├─ bufchan         36.21 ms      │ 137.7 ms      │ 36.86 ms      │ 38.39 ms      │ 100     │ 100
│  ├─ bufchan_buf0    36.52 ms      │ 139.8 ms      │ 39.61 ms      │ 43.39 ms      │ 100     │ 100
│  ├─ crossbeam       37.43 ms      │ 47.99 ms      │ 39.08 ms      │ 39.57 ms      │ 100     │ 100
│  ├─ flume           122.4 ms      │ 137.6 ms      │ 125.8 ms      │ 128 ms        │ 100     │ 100
│  ├─ kanal           39.28 ms      │ 45 ms         │ 41.34 ms      │ 41.52 ms      │ 100     │ 100
│  ╰─ std_mpsc        37.15 ms      │ 46.83 ms      │ 39.38 ms      │ 39.92 ms      │ 100     │ 100
├─ non_copy_24bytes                 │               │               │               │         │
│  ├─ bufchan         37.5 ms       │ 47.86 ms      │ 38.3 ms       │ 38.77 ms      │ 100     │ 100
│  ├─ bufchan_buf0    37.07 ms      │ 141.6 ms      │ 38.78 ms      │ 40.44 ms      │ 100     │ 100
│  ├─ crossbeam       126.7 ms      │ 138.8 ms      │ 130 ms        │ 132 ms        │ 100     │ 100
│  ├─ flume           127.4 ms      │ 145.4 ms      │ 129.7 ms      │ 132 ms        │ 100     │ 100
│  ├─ kanal           40.92 ms      │ 47 ms         │ 43.13 ms      │ 43.27 ms      │ 100     │ 100
│  ╰─ std_mpsc        127.3 ms      │ 142.6 ms      │ 129.2 ms      │ 131.5 ms      │ 100     │ 100
╰─ non_copy_256bytes                │               │               │               │         │
   ├─ bufchan         38.28 ms      │ 49.06 ms      │ 39.82 ms      │ 40.31 ms      │ 100     │ 100
   ├─ bufchan_buf0    38.14 ms      │ 143.3 ms      │ 41.46 ms      │ 44.6 ms       │ 100     │ 100
   ├─ crossbeam       127.6 ms      │ 142.1 ms      │ 130.7 ms      │ 133.3 ms      │ 100     │ 100
   ├─ flume           128.2 ms      │ 142.2 ms      │ 130.9 ms      │ 132.7 ms      │ 100     │ 100
   ├─ kanal           45.96 ms      │ 60.91 ms      │ 48.29 ms      │ 48.68 ms      │ 100     │ 100
   ╰─ std_mpsc        127.7 ms      │ 142.2 ms      │ 130.9 ms      │ 132.8 ms      │ 100     │ 100

As was mentioned previously, it's most likely due to the batching.

Note that you're still buffering even with bufchan_buf0 since you're only using try_lock on the Mutex.

In other words, your lim is actually the minimum number of items in a batch minus one. It doesn't actually limit the buffer, which could grow indefinitely.

Good point. Maybe a fair comparison would be against just a simple Mutex<VecDeque> to highlight the buffering.


Thanks everyone for your input!

If the internal structure of a queue can benefit from batch submission then it should have a dedicated multi-item submit API so that the sender can choose between immediate single-item delivery and submitting a whole batch.

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.