I'm working on two pyspark dataframes and doing a left-anti join on them to track everyday changes and then send an email.
The first time I tried:
diff = Table_a.join(
Table_b,
[Table_a.col1== Table_b.col1, Table_a.col2== Table_b.col2],
how='left_anti'
)
Expected output is a pyspark dataframe with some or no data.
This diff dataframe gets it's schema from Table_a. The first time I ran it, showed no data as expected with the schema representation. The next time onwards just throws SparkException:
Exception thrown in Future.get
I use Scala, but, from my experience, this happens when one of the underlying tables has been changed somehow. My advice would be to try to run simply
display(Table_a)
and display(Table_b)
, and see if any of those commands fail. This should give you a hint about where is the problem.
In any case, to effectively solve the issue, my advice would clearing the cache running
%sql
REFRESH my_schema.table_a
REFRESH my_schema.table_b
and, then, redefining those variables, as in
Table_a = spark.table("my_schema.table_a")
Table_b = spark.table("my_schema.table_b")
This worked for me - hope it helps you too.
Thank you @Lucas Lima. Every time i create a new table i clear the cache with the following command in pyspark:
table_a.cache()
Hope the information helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With