Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tuning S3 file sizes for Kafka

I am trying to understand flush.size and rotate.interval.ms configuration for S3 connector in depth. I deployed S3 connector and I seem to have file sizes ranging from 6 kb all the way to 30 mb wondering if anyone here can help me with suggestions on how to get almost equal file sizes.

Here are my settings: flush.size= 200000, rotate.interval.ms=10min

We tried rolling our own connector as well based on an example in this git https://github.com/canelmas/kafka-connect-field-and-time-partitioner still we can't get the file sizes to be around the same size.

like image 965
DataJanitor Avatar asked Oct 16 '25 13:10

DataJanitor


1 Answers

The S3 Sink Connector writes data to the partition path per Kafka partition and partition path defined by partitione.class.

Basically, S3 Connector flush buffers into the below condition.

  1. rotate.schedule.interval.ms: if this time has passed
  2. rotate.interval.ms: time has passed in terms of timestamp.extractor time

Note: This helpful clear backlog data lets assume rotate.interval.ms and we have 6 hours delay data then so every timestamp passed 10 minute flush will get delay in a few second in contrary if data not flowing it will wait to receive next rotate.interval.ms passed

  1. flush.size: let's assume data flows quite high and if the message reached to flush.size before points 1 & 2 then flush will get a trigger. In the same time if data size flow low then flush will get trigger based on point 1 & 2

In case of Time Based Partitioner

  1. partition.duration.ms: Defines the maximum time flush to s3 within a single encoded partition directory.
like image 167
Nitin Avatar answered Oct 18 '25 09:10

Nitin



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!