I am able to develop a pipeline which reads from kafka does some transformations and write the output to kafka sink as well as parque sink. I would like adding effective logging to log the intermediate results of the transformation like in a regular streaming application.
One option I see is to log the queryExecutionstreams via
df.queryExecution.analyzed.numberedTreeString
or
logger.info("Query progress"+ query.lastProgress)
logger.info("Query status"+ query.status)
But this doesn't seem to have a way to see the business specific messages on which the stream is running on.
Is there a way how I can add more logging info like the data which it's processing?
I found some options to track the same .Basically we can name our streaming query using df.writeStream.format("parquet") .queryName("table1").
The query name table1 will be printed in the Spark Jobs Tab against the Completed Jobs list in the Spark UI from which you can track the status for each of the streaming queries
ProgressReporter API in Structured Streaming to collect more stats.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With