we have a kafka cluster and spark-streaming consumer. currently, the offset is stored at the consumer side on a data-store. when latest kafka, which is what we use, provides the feature to store consumer offset at broker level (on the _consumer_offset topic in kafka), what was the reasoning to store at consumer side.
One argument would be if kafka cluster goes down, we still have offset information. But if kafka cluster goes down, even messages are lost and no message can be replayed for a given offset.
I am missing something obvious, but can't figure it out. Thanks
As I understood, the core question you want to be answered is
One argument would be if kafka cluster goes down, we still have offset information. But if kafka cluster goes down, even messages are lost and no message can be replayed for a given offset.
By storing offset ranges externally, it allows Spark Streaming applications the ability to restart and replay messages from any point in time as long as the messages are still alive in Kafka. So the decision to store offsets externally is likely not just based on recovery scenarios, but rather a general one.
This link from Cloudera is very good
As it is mentioned in Spark Streaming + Kafka Integration Guide the way you store commit offsets is dependent on how strict your reliability requirements are.
You may have a couple of options depending of which of streaming APIs you use.
First and the simplest one option is to configure external checkpoint location for stashing your data and consumer offsets. It allows you to easily recover your spark code after errors and produces idempotent output (handy when you write data to files). When you use DStream, you should disable enable.auto.commit.
Also you may commit offsets manually either to Kafka or to your storage (see examples from link above). In this case you have are responsible for making your outputs idempotent.
Here you have no option else than storing offsets at a checkpoint directory (e.g. HDFS). See Structured Streaming + Kafka Integration Guide (the same for Spark 2.2.x and 2.3.0).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With