I am treating a large amount of files, and I want to treat these files chunk by chunk, let's say that during each batch, I want to treat each 50 files separately.
How can I do it using Spark Structured Streaming ?
I have seen that Jacek Laskowski (https://stackoverflow.com/users/1305344/jacek-laskowski) said in a similar question (Spark to process rdd chunk by chunk from json files and post to Kafka topic) that it was possible using the Spark Structured Streaming, but I can't find any examples about it.
Thanks a lot,
If using File Source:
maxFilesPerTrigger: maximum number of new files to be considered in every trigger (default: no max)
spark
.readStream
.format("json")
.path("/path/to/files")
.option("maxFilesPerTrigger", 50)
.load
If using a Kafka Source it would be similar but with the maxOffsetsPerTrigger option.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With