One consumer to multiple tables or many consumers per table

Question

I have a kafka topic with millions of sale events. I have a consumer which on every message will insert the data into 4 table: 1 for the raw sales, 1 for the sales sum by date by product category (date, product_category, sale_sum) 1 for the sales sum by date for customer (date, customer_id, sale_sum) 1 for the sales by date for location (date, location_id, sale_sum)

I use a SQL database for storing my data, so the operations above are insert or update operations.

I am wondering, would it be better to have (i) 1 consumer insert into these 4 tables or (ii) 4 consumers, each responsible for inserting into each table?

What is best practice here?

Thanks

aran · Accepted Answer

From my point of view, you have three different alternatives. I'd personally choose the third one.

1 - One [consumer-producer] thread

In this scenario, you just have one thread that is responsible of:

1-Reading from Kafka
2-Process/Store in I
3-Process/Store in II
4-Process/Store in III
5-Process/Store in IV

All that, in sequential order, as you just have one thread that both consumes and processes the messages.

  kafka-->(read)-->(process 1)-->(process 2)-->(process 3)-->process(4)

In this case, if any of the 2 to 5 steps gets "damaged" and the speed of processing decreases at some point, your entire process will slow down. And with that, the kafka topic's lag, which will increase as far as the thread doesn't finish the 5th step earlier than the new message arrives at Kafka.

For me, this is a no-no regarding performance and fault tolerance

2 - Four [consumer-producer]s

This uses the same paradigm as the first scenario: the thread that reads also is responsible of the processing.

But, thanks to consumer groups, you can parallelise the whole process. Create 4 different groups and assign each one to a consumer.

For simplicity, let's just create one thread per consumer group.

In this scenario, you have something like:

CONSUMER CG1
1-Reading from Kafka
2-Process/Store in I

CONSUMER CG2
1-Reading from Kafka
2-Process/Store in II

CONSUMER CG3
1-Reading from Kafka
2-Process/Store in III

CONSUMER CG4
1-Reading from Kafka
2-Process/Store in IV

       |-->consumer 1-->(process1)-->T1
  kafka|-->consumer 2-->(process2)-->T2
       |-->consumer 3-->(process2)-->T3
       |-->consumer 4-->(process4)-->T4

Advantages: each thread is responsible of a limited number of tasks. This will help with the lag of each consumer group.

Furthermore, if some of the storing tasks fail or decrease their performance, that won't affect the other three threads: They will continue reading and processing from kafka on their own.

3. Decouple consuming and processing

This is by far, in my oppinion, the best possible solution.

You divide the tasks of reading and the tasks of processing. This way, you can for example launch:

One consumer thread

This just reads the messages from kafka and stores them in in-memory queues, or similar structures that are accessible from the worker threads, and that's all. Just continue reading and putting the message in queues.
X worker threads (in this case, 4)

These threads are responsible for getting the messages that the consumer put in the queues (or queues, depending on how you want to code it), and processing/storing the messages in each table.

Something like:

                            |--> queue1 -----> worker 1 --> T1
  kafka--->consumer--(msg)--|--> queue2 -----> worker 2 --> T2
                            |--> queue3 -----> worker 3 --> T3
                            |--> queue4 -----> worker 4 --> T4

What you get here is: paralellization, decoupling of processing and consuming. Here kafka's lag will, at 99% of the time, 0.

In this approach, the queues are the ones that act like buffers if some of the workers get stuck. The other whole system (mainly Kafka) will not be affected by the processing logic.

Note that even Kafka won't start lagging and possibly losing messages due to retention, the internal queues must be monitored, or configured properly to send the lagged messages inside the queue to a dead-letter queue, in order to avoid the consumer getting stuck.

This is from the KafkaConsumer Javadoc, which better explains the pros and contras of each paradigm:

enter image description here

A simple diagram showing the advantages of the third scenario:

enter image description here

Consumer thread just consumes. This avoids kafka lagging, delays in the data that must be processed (remember, this should be near real-time) and loss of messages because of retention kicking in.

The other x workers are responsible for the actual processing logic. If something fails in one of them, no other consumer or worker thread gets affected.

One consumer to multiple tables or many consumers per table

Tags:

apache-kafka

stream-processing

friartuck

1 Answers

1 - One [consumer-producer] thread

2 - Four [consumer-producer]s

3. Decouple consuming and processing

aran

Recent Activity

Donate For Us

One consumer to multiple tables or many consumers per table

Tags:

apache-kafka

stream-processing

friartuck

1 Answers

1 - One [consumer-producer] thread

2 - Four [consumer-producer]s

3. Decouple consuming and processing

aran

Related questions

Recent Activity

Donate For Us