Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to monitor Spark job with Airflow

I set up a few dags, which eventually ends with a spark-submit command to a spark cluster. I'm using cluster mode if that makes a difference. Anyways, so my code works, but I realized if the spark job were to fail, I wouldn't necessarily know from within the Airflow UI. By triggering the job via cluster mode, Airflow hands off the job to an available worker, therefore airflow has no knowledge of the spark job.

How can I address this issue?

like image 903
sdot257 Avatar asked Oct 27 '25 04:10

sdot257


2 Answers

Airflow (from version 1.8) has

SparkSqlOperator - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/spark_sql_operator.py ;
SparkSQLHook code - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/spark_sql_hook.py
SparkSubmitOperator - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/spark_submit_operator.py
SparkSubmitHook code - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/spark_submit_hook.py

If you use these, the airflow task will fail if the spark job fails. You might have to change the logging part in the spark_submit_hook file if you use spark1.x to get real time logs, because spark-submit logs even the errors to stdout for some of 1.x versions (I had to make changes for 1.6.1).

Also note that there have been many improvements to the SparkSubmitOperator since last stable release.

like image 167
Him Avatar answered Oct 30 '25 13:10

Him


You can consider using client mode, since the client will not terminate until the spark job is complete. Airflow executor can pick up the exit code.

Otherwise you may need to use a job server. Check out https://github.com/spark-jobserver/spark-jobserver

like image 22
Derek Chan Avatar answered Oct 30 '25 14:10

Derek Chan