How to deduplicate messages while streaming kafka using Spark Streaming?

Question

I have a case where Kafka producers sends the data twice a day. These producers read all the data from the database/files and send to Kafka. So these messages sent every day, which is duplicated. I need to deduplicate the message and write in some persistent storage using the Spark Streaming. What will the best way of removing the duplicate messages in this case?

The duplicate message sent is a json string with the timestamp field is only updated.

Note: I can't change Kafka Producer to send only the new data/message, it's already installed in the client machine and written by someone else.

Natalia · Accepted Answer

For deduplication, you need to store somewhere information about what was already processed (for example unique ids of messages).

To store messages you can use:

spark checkpoints. Pros: out-of-the-box. Cons: if you update the source code of app, you need to clean checkpoints. As result, you will lose information. Solution can work, if the requirements for deduplication is not strict.
any database. For example, if you running on hadoop env, you can use Hbase. For every message you do 'get' (check that it wasn't sent before), and mark in DB sent when it is really send.

Kamal Chandraprakash · Answer

You can the change the topic configuration to compact mode. With compaction, a record with same key will be overwritten/updated in the Kafka log. There by you get only the latest value for a key from Kafka.

You can read more about compaction here.

How to deduplicate messages while streaming kafka using Spark Streaming?

Tags:

duplicates

apache-kafka

apache-spark

spark-streaming

koiralo

2 Answers

Natalia

Kamal Chandraprakash

Recent Activity

Donate For Us

How to deduplicate messages while streaming kafka using Spark Streaming?

Tags:

duplicates

apache-kafka

apache-spark

spark-streaming

koiralo

2 Answers

Natalia

Kamal Chandraprakash

Related questions

Recent Activity

Donate For Us