I have a table on Hive and I am trying to insert data in that table.
I am taking data from SQL but I don't want to insert id which already exists in the Hive table. I am trying to use the same condition like where not exists. I am using PySpark on Airflow.
The exists operator doesn't exist in Spark but there are 2 join operators that can replace it : left_anti and left_semi.
If you want for example to insert a dataframe df in a hive table target, you can do :
new_df = df.join(
spark.table("target"),
how='left_anti',
on='id'
)
then you write new_df in your table.
left_anti allows you to keep only the lines which do not meet the join condition (equivalent of not exists). The equivalent of exists is left_semi.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With