why kafka offsets are stored at consumer end of application?

Question

we have a kafka cluster and spark-streaming consumer. currently, the offset is stored at the consumer side on a data-store. when latest kafka, which is what we use, provides the feature to store consumer offset at broker level (on the _consumer_offset topic in kafka), what was the reasoning to store at consumer side.

One argument would be if kafka cluster goes down, we still have offset information. But if kafka cluster goes down, even messages are lost and no message can be replayed for a given offset.

I am missing something obvious, but can't figure it out. Thanks

senseiwu · Accepted Answer

As I understood, the core question you want to be answered is

One argument would be if kafka cluster goes down, we still have offset information. But if kafka cluster goes down, even messages are lost and no message can be replayed for a given offset.

By storing offset ranges externally, it allows Spark Streaming applications the ability to restart and replay messages from any point in time as long as the messages are still alive in Kafka. So the decision to store offsets externally is likely not just based on recovery scenarios, but rather a general one.

This link from Cloudera is very good

zavyrylin · Answer

As it is mentioned in Spark Streaming + Kafka Integration Guide the way you store commit offsets is dependent on how strict your reliability requirements are.

You may have a couple of options depending of which of streaming APIs you use.

DStream

First and the simplest one option is to configure external checkpoint location for stashing your data and consumer offsets. It allows you to easily recover your spark code after errors and produces idempotent output (handy when you write data to files). When you use DStream, you should disable enable.auto.commit.

Also you may commit offsets manually either to Kafka or to your storage (see examples from link above). In this case you have are responsible for making your outputs idempotent.

Structured Streaming

Here you have no option else than storing offsets at a checkpoint directory (e.g. HDFS). See Structured Streaming + Kafka Integration Guide (the same for Spark 2.2.x and 2.3.0).

why kafka offsets are stored at consumer end of application?

Tags:

apache-kafka

brain storm

2 Answers

senseiwu

zavyrylin

Recent Activity

Donate For Us

why kafka offsets are stored at consumer end of application?

Tags:

apache-kafka

brain storm

2 Answers

senseiwu

zavyrylin

Related questions

Recent Activity

Donate For Us