Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Logging in spark structured streaming

I am able to develop a pipeline which reads from kafka does some transformations and write the output to kafka sink as well as parque sink. I would like adding effective logging to log the intermediate results of the transformation like in a regular streaming application.

One option I see is to log the queryExecutionstreams via

df.queryExecution.analyzed.numberedTreeString 

or

logger.info("Query progress"+ query.lastProgress)
logger.info("Query status"+ query.status)

But this doesn't seem to have a way to see the business specific messages on which the stream is running on.

Is there a way how I can add more logging info like the data which it's processing?

like image 204
Ajith Kannan Avatar asked Oct 28 '25 14:10

Ajith Kannan


1 Answers

I found some options to track the same .Basically we can name our streaming query using df.writeStream.format("parquet") .queryName("table1").

The query name table1 will be printed in the Spark Jobs Tab against the Completed Jobs list in the Spark UI from which you can track the status for each of the streaming queries

  1. Use the ProgressReporter API in Structured Streaming to collect more stats.
like image 120
Ajith Kannan Avatar answered Oct 31 '25 12:10

Ajith Kannan



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!