Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to process files using Spark Structured Streaming chunk by chunk?

I am treating a large amount of files, and I want to treat these files chunk by chunk, let's say that during each batch, I want to treat each 50 files separately.

How can I do it using Spark Structured Streaming ?

I have seen that Jacek Laskowski (https://stackoverflow.com/users/1305344/jacek-laskowski) said in a similar question (Spark to process rdd chunk by chunk from json files and post to Kafka topic) that it was possible using the Spark Structured Streaming, but I can't find any examples about it.

Thanks a lot,

like image 1000
mahmoud mehdi Avatar asked Dec 02 '25 09:12

mahmoud mehdi


1 Answers

If using File Source:

maxFilesPerTrigger: maximum number of new files to be considered in every trigger (default: no max)

spark
  .readStream
  .format("json")
  .path("/path/to/files")
  .option("maxFilesPerTrigger", 50)
  .load

If using a Kafka Source it would be similar but with the maxOffsetsPerTrigger option.

like image 72
bp2010 Avatar answered Dec 04 '25 23:12

bp2010



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!