High performance kafka consumer service

barkanido · December 23, 2020, 10:25am

hey folks, I planning to write a service that will consume from kafka and write a few keys to a key-value data store for each message consumed. I plan to use rust-rdkafka. currently this kafka topic has around 250K messages per second, so performance counts
Thing is my database client is blocking. Currently I see 2 options:

use the non blocking mode for the kafka consumer and do everything on Tokio thread pools? ( tokio::spawn + tokio::task::spawn_blocking )
consume in a serial blocking way (with while let Some(message) = message_stream.next().await or somehow feed this into rayon par_iter.

my biggest concern is the ability to back-pressure the message consuming when the database starts to fail/respond slowly.
What do you guys think should work best?

alice · December 23, 2020, 11:29am

My recommendation is to spawn a dedicated thread for your kafka connection. You can read why this is preferable to spawn_blocking in my blog post on blocking, specifically the section on things that run forever.

To communicate with the kafka thread, I recommend the use of a bounded tokio::sync::mpsc channel. The use of a bounded channel gives backpressure to your application.

If you use Tokio 0.3, the mpsc channel provides blocking versions of the send and recv methods, which you can use in the dedicated thread. Alternatively on 0.2, you can use futures::executor::block_on on the async send/recv methods.

Regarding the 250K msg/sec thing, you might want to try batching the messages you send into the mpsc channel for better performance. An alternative optimization you can try is to wrap the entire blocking kafka loop inside block_on so you can just send with an .await. This is not a problem blocking-the-thread-wise because you can do it in a manner such that no other tasks run inside that block_on call.

barkanido · December 23, 2020, 1:02pm

Thanks,

To communicate with the Kafka thread, I recommend the use of a bounded tokio::sync::mpsc

This is what we do currently in Clojure with core.async and it works well. Good to know that this is the best practice here also.
The basic algorithm for a message mixes CPU and IO-bound processing so I am wondering how (and if) to tell Tokio what part should it do in which thread pool, and how exactly should I chain them:

consume a message
parse JSON
validation + possible filtering
possibly producing some data into another kafka topic
write a few records into aerospike
GOTO 1

barkanido · December 27, 2020, 7:46am

Isn't tokio::sync::mpsc limits the processing part to a single task processing each message?
rust-rdkafka show the async example running tasks multiplexed on num_workers. How do I set an optimal number of workers?

alice · December 27, 2020, 8:57am

If you want multiple receivers, you can use the async-channel or flume crates which provide mpmc channels.

As for finding an optimal number of workers, I recommend benchmarking it.

system · March 30, 2021, 9:38am

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Kafka forwarder to either A) a DB or B) back to Kafka code review	4	63	November 3, 2024
Parallel Streaming with Kafka and Redis help	2	930	April 21, 2020
Streams / spawn / rdkafka help	2	666	March 27, 2024
Code review: project with tokio and futures help	4	1090	January 12, 2023
Help implementing message buffering and flushing in async Rust help	3	1868	December 15, 2019

High performance kafka consumer service

Related topics