Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to deduplicate messages while streaming kafka using Spark Streaming?

I have a case where Kafka producers sends the data twice a day. These producers read all the data from the database/files and send to Kafka. So these messages sent every day, which is duplicated. I need to deduplicate the message and write in some persistent storage using the Spark Streaming. What will the best way of removing the duplicate messages in this case?

The duplicate message sent is a json string with the timestamp field is only updated.

Note: I can't change Kafka Producer to send only the new data/message, it's already installed in the client machine and written by someone else.

like image 479
koiralo Avatar asked Oct 29 '25 16:10

koiralo


2 Answers

For deduplication, you need to store somewhere information about what was already processed (for example unique ids of messages).

To store messages you can use:

  1. spark checkpoints. Pros: out-of-the-box. Cons: if you update the source code of app, you need to clean checkpoints. As result, you will lose information. Solution can work, if the requirements for deduplication is not strict.

  2. any database. For example, if you running on hadoop env, you can use Hbase. For every message you do 'get' (check that it wasn't sent before), and mark in DB sent when it is really send.

like image 120
Natalia Avatar answered Oct 31 '25 12:10

Natalia


You can the change the topic configuration to compact mode. With compaction, a record with same key will be overwritten/updated in the Kafka log. There by you get only the latest value for a key from Kafka.

You can read more about compaction here.

like image 31
Kamal Chandraprakash Avatar answered Oct 31 '25 10:10

Kamal Chandraprakash



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!