Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

See progress while "iterating" over Dataframe

I wonder if there is a better way to see if Pyspark is making progress (while writing to a PL/SQL DB). Currently the only output i see, while my code is running is:

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 19/09/17 16:33:17 WARN JdbcUtils: Requested isolation level 1 is not > supported; falling back to default isolation level 2
[Stage 3:=============================> (1 + 1) / 2]

This will stay the same from 1 minute to 1 hour depending on the dataframe size. Normally i would use progessbar2 or make a counter myself. But Spark works different and does not "iterate" the classic way, so i can not wrap the udf with the progressbar2 lib.

The Problem is, it is difficult to see if my program just runs over a large dataframe or someone has forgotten to commit to the SQL DB. Because, when Pyspark is waiting for a commit, it looks just the same. So as you may guessed, i have wasted plenty of time there.

df_c = df_a.withColumn("new_col", my_udf(df_b["some_col"]))

Would be nice to see some sort of progress from pyspark, while doing this step.

like image 905
Tobias J. Avatar asked Feb 18 '26 00:02

Tobias J.


1 Answers

You can check on the Spark-UI what your Spark cluster is currently doing. Here you can check if Spark tasks are being completed or if everything hangs. The default URL of the Spark UI is http://<driver-node>:4040.

If you need the data in a more structured way (for example for automated processing) you can use the Spark-UI's REST-Interface.

like image 148
werner Avatar answered Feb 21 '26 05:02

werner



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!