I have a question on Kafka auto-commit mechanism. I'm using Spring-Kafka with auto-commit enabled. As an experiment, I disconnected my consumer's connection to Kafka for 30 seconds while the system was idle (no new messages in the topic, no messages being processed). After reconnecting I got a few messages like so:
Asynchronous auto-commit of offsets {cs-1915-2553221872080030-0=OffsetAndMetadata{offset=19, leaderEpoch=0, metadata=''}} failed: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
First, I don't understand what is there to commit? The system was idle (all previous messages were already committed). Second, the disconnection time was 30 seconds, much less than the 5 minutes (300000 ms) max.poll.interval.ms Third, in an uncontrolled failure of Kafka I got at least 30K messages of this type, which was resolved by restarting the process. Why is this happening?
I'm listing here my consumer configuration:
allow.auto.create.topics = true
auto.commit.interval.ms = 100
auto.offset.reset = latest
bootstrap.servers = [kafka1-eu.dev.com:9094, kafka2-eu.dev.com:9094, kafka3-eu.dev.com:9094]
check.crcs = true
client.dns.lookup = default
client.id =
client.rack =
connections.max.idle.ms = 540000
default.api.timeout.ms = 60000
enable.auto.commit = true
exclude.internal.topics = true
fetch.max.bytes = 52428800
fetch.max.wait.ms = 500
fetch.min.bytes = 1
group.id = feature-cs-1915-2553221872080030
group.instance.id = null
heartbeat.interval.ms = 3000
interceptor.classes = []
internal.leave.group.on.close = true
isolation.level = read_uncommitted
key.deserializer = class org.apache.kafka.common.serialization.StringDeserializer
max.partition.fetch.bytes = 1048576
max.poll.interval.ms = 300000
max.poll.records = 500
metadata.max.age.ms = 300000
metric.reporters = []
metrics.num.samples = 2
metrics.recording.level = INFO
metrics.sample.window.ms = 30000
partition.assignment.strategy = [class org.apache.kafka.clients.consumer.RangeAssignor]
receive.buffer.bytes = 65536
reconnect.backoff.max.ms = 1000
reconnect.backoff.ms = 50
request.timeout.ms = 30000
retry.backoff.ms = 100
sasl.client.callback.handler.class = null
sasl.jaas.config = null
sasl.kerberos.kinit.cmd = /usr/bin/kinit
sasl.kerberos.min.time.before.relogin = 60000
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
sasl.kerberos.ticket.renew.window.factor = 0.8
sasl.login.callback.handler.class = null
sasl.login.class = null
sasl.login.refresh.buffer.seconds = 300
sasl.login.refresh.min.period.seconds = 60
sasl.login.refresh.window.factor = 0.8
sasl.login.refresh.window.jitter = 0.05
sasl.mechanism = GSSAPI
security.protocol = SSL
send.buffer.bytes = 131072
session.timeout.ms = 15000
ssl.cipher.suites = null
ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
ssl.endpoint.identification.algorithm = https
ssl.key.password = [hidden]
ssl.keymanager.algorithm = SunX509
ssl.keystore.location = /home/me/feature-2553221872080030.keystore
ssl.keystore.password = [hidden]
ssl.keystore.type = JKS
ssl.protocol = TLS
ssl.provider = null
ssl.secure.random.implementation = null
ssl.trustmanager.algorithm = PKIX
ssl.truststore.location = /home/me/feature-2553221872080030.truststore
ssl.truststore.password = [hidden]
ssl.truststore.type = JKS
value.deserializer = class org.springframework.kafka.support.serializer.ErrorHandlingDeserializer2
First, I don't understand what is there to commit?
You are right, there is nothing new to commit if no new data is flowing. However, having auto.commit enabled and your consumer is still running (even without being able to connect to broker) the poll method is still responsible of the following steps:
Together with your interval of 100ms (see auto.commit.intervals) the consumer still tries to asynchronously commit the (non changing) offset position of the consumer.
Second, the disconnection time was 30 seconds, much less than the 5 minutes (300000 ms) max.poll.interval.ms
It is not the max.poll.interval that is causing the rebalance but rather the combination of your heartbeat.interval.ms setting and the session.timeout.ms. Your consumer sends in a background thread heartbeats based on the interval setting, in your case 3 seconds. If no heartbeats are received by the broker before the expiration of this session timeout (in your case 15 seconds), then the broker will remove this client from the group and initiate a rebalance.
A more detailed description of the configuration I mentioned are given in the Kafka documentation on Consumer Configs
Third, in an uncontrolled failure of Kafka I got at least 30K messages of this type, which was resolved by restarting the process. Why is this happening?
That seems to be a combination of the first two questions, where heartbeats cannot be sent and still the consumer is trying to commit through the contiuously called poll method.
As @GaryRussell mentioned in his comment, I would be careful to use auto.commit.enabled and rather take the control over the Offset Management to yourself.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With