Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

why kafka offsets are stored at consumer end of application?

Tags:

apache-kafka

we have a kafka cluster and spark-streaming consumer. currently, the offset is stored at the consumer side on a data-store. when latest kafka, which is what we use, provides the feature to store consumer offset at broker level (on the _consumer_offset topic in kafka), what was the reasoning to store at consumer side.

One argument would be if kafka cluster goes down, we still have offset information. But if kafka cluster goes down, even messages are lost and no message can be replayed for a given offset.

I am missing something obvious, but can't figure it out. Thanks

like image 914
brain storm Avatar asked Nov 18 '25 20:11

brain storm


2 Answers

As I understood, the core question you want to be answered is

One argument would be if kafka cluster goes down, we still have offset information. But if kafka cluster goes down, even messages are lost and no message can be replayed for a given offset.

By storing offset ranges externally, it allows Spark Streaming applications the ability to restart and replay messages from any point in time as long as the messages are still alive in Kafka. So the decision to store offsets externally is likely not just based on recovery scenarios, but rather a general one.

This link from Cloudera is very good

like image 79
senseiwu Avatar answered Nov 21 '25 10:11

senseiwu


As it is mentioned in Spark Streaming + Kafka Integration Guide the way you store commit offsets is dependent on how strict your reliability requirements are.

You may have a couple of options depending of which of streaming APIs you use.

  1. DStream

First and the simplest one option is to configure external checkpoint location for stashing your data and consumer offsets. It allows you to easily recover your spark code after errors and produces idempotent output (handy when you write data to files). When you use DStream, you should disable enable.auto.commit.

Also you may commit offsets manually either to Kafka or to your storage (see examples from link above). In this case you have are responsible for making your outputs idempotent.

  1. Structured Streaming

Here you have no option else than storing offsets at a checkpoint directory (e.g. HDFS). See Structured Streaming + Kafka Integration Guide (the same for Spark 2.2.x and 2.3.0).

like image 36
zavyrylin Avatar answered Nov 21 '25 09:11

zavyrylin