I have two pyspark dataframes like below -
df1
id city country region continent
1 chicago USA NA NA
2 houston USA NA NA
3 Sydney Australia AU AU
4 London UK EU EU
df2
id city country region continent
1 chicago USA NA NA
2 houston USA NA NA
3 Paris France EU EU
5 London UK EU EU
I want to find out the rows which exists in df2 but not in df1 based on all column values. So df2 - df1 should result in df_result like below
df_result
id city country region continent
3 Paris France EU EU
5 London UK EU EU
How can I achieve it in pyspark. Thanks in advance
You can use a left_anti join:
df2.join(df1, on = ["id", "city", "country"], how = "left_anti").show()
+---+------+-------+------+---------+
| id| city|country|region|continent|
+---+------+-------+------+---------+
| 3| Paris| France| EU| EU|
| 5|London| UK| EU| EU|
+---+------+-------+------+---------+
If all columns have non-null values:
df2.join(df1, on = df2.schema.names, how = "left_anti").show()
One more easy solution would be using exceptAll() function. Doc says-
Return a new SparkDataFrame containing rows in this SparkDataFrame but not in another SparkDataFrame while preserving the duplicates. This is equivalent to EXCEPT ALL in SQL. Also as standard in SQL, this function resolves columns by position (not by name)
df_a = spark.createDataFrame([(1,"chicago","USA","NA","NA"),(2,"houston","USA","NA","NA"),(3,"Sydney","Australia","AU","AU"),(4,"London","UK","EU","EU")],[ "id","city","country","region","continent"])
df_a.show(truncate=False)
df_b = spark.createDataFrame([(1,"chicago","USA","NA","NA"),(2,"houston","USA","NA","NA"),(3,"Paris","France","EU","EU"),(5,"London","UK","EU","EU")],[ "id","city","country","region","continent"])
df_b.show(truncate=False)
df_a
+---+-------+---------+------+---------+
|id |city |country |region|continent|
+---+-------+---------+------+---------+
|1 |chicago|USA |NA |NA |
|2 |houston|USA |NA |NA |
|3 |Sydney |Australia|AU |AU |
|4 |London |UK |EU |EU |
+---+-------+---------+------+---------+
df_b
+---+-------+-------+------+---------+
|id |city |country|region|continent|
+---+-------+-------+------+---------+
|1 |chicago|USA |NA |NA |
|2 |houston|USA |NA |NA |
|3 |Paris |France |EU |EU |
|5 |London |UK |EU |EU |
+---+-------+-------+------+---------+
df_final = df_b.exceptAll(df_a)
df_final.show()
+---+------+-------+------+---------+
| id| city|country|region|continent|
+---+------+-------+------+---------+
| 3| Paris| France| EU| EU|
| 5|London| UK| EU| EU|
+---+------+-------+------+---------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With