I'm new to Kafka, and our team is investigating patterns for inter-service communication.
The goal
We have two services, P (Producer) and C (Consumer). P is the source of truth for a set of data that C needs. When C starts up it needs to load all of the current data from P into its cache, and then subscribe to change notifications. (In other words, we want to synchronize data between the services.)
The total amount of data is relatively low, and changes are infrequent. A brief delay in synchronization is acceptable (eventual consistency).
We want to decouple the services so that P and C do not need to know about each other.
The proposal
When P starts up, it publishes all of its data to a Kafka topic that has log compaction enabled. Each message is an aggregate with a key of its ID.
When C starts up, it reads all of the messages from the beginning of the topic and populates its cache. It then keeps reading from its offset to be notified of updates.
When P updates its data, it publishes a message for the aggregate that changed. (This message has the same schema as the original messages.)
When C receives a new message, it updates the corresponding data in its cache.

Constraints
We are using the Confluent REST Proxy to communicate with Kafka.
The issue
When C starts up, how does it know when it's read all of the messages from the topic so that it can safely start processing?
It's acceptable if C does not immediately notice a message that P sent a second ago. It's not acceptable if C starts processing before consuming a message that P sent an hour ago. Note that we don't know when updates to P's data will occur.
We do not want C to have to wait for the REST Proxy's poll interval after consuming each message.
The Kafka REST Proxy provides a RESTful interface to a Kafka cluster. It makes it easy to produce and consume messages, view the state of the cluster, and perform administrative actions without using the native Kafka protocol or clients.
The Kafka REST Proxy is a RESTful web API that allows your application to send and receive messages using HTTP rather than TCP. It can be used to produce data to and consume data from Kafka or for executing queries on cluster configuration.
Of course it's open source and Apache 2.0 licensed. Show activity on this post. You can use the Confluent REST Proxy with no software/licensing costs.
If you would like to find the end partitions of a consumer group, in order to know when you've gotten all data at a point in time, you can use
POST /consumers/(string: group_name)/instances/(string: instance)/positions/end
Note that you must do a poll (GET /consumers/.../records) before that seek, but you don't need to commit.
If you don't want to affect the offsets of your existing consumer group, you would have to post a separate one.
You can then query offsets with
GET /consumers/(string: group_name)/instances/(string: instance)/offsets
Note that there might be data being written to the topic between calculating the end offsets and actually reaching the end, so you might want to have some additional settings to do a few more consumptions once you finally do reach the end.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With