Pyspark : Subtracting/Difference pyspark dataframes based on all columns

Question

I have two pyspark dataframes like below -

df1

id     city      country       region    continent
1      chicago    USA          NA         NA
2      houston    USA          NA         NA
3      Sydney     Australia    AU         AU
4      London     UK           EU         EU

df2

id     city      country       region    continent
1      chicago    USA          NA         NA
2      houston    USA          NA         NA
3      Paris      France       EU         EU
5      London     UK           EU         EU

I want to find out the rows which exists in df2 but not in df1 based on all column values. So df2 - df1 should result in df_result like below

df_result

id     city      country       region    continent
3      Paris      France       EU         EU
5      London     UK           EU         EU

How can I achieve it in pyspark. Thanks in advance

Cena · Accepted Answer

You can use a left_anti join:

df2.join(df1, on = ["id", "city", "country"], how = "left_anti").show()

+---+------+-------+------+---------+
| id|  city|country|region|continent|
+---+------+-------+------+---------+
|  3| Paris| France|    EU|       EU|
|  5|London|     UK|    EU|       EU|
+---+------+-------+------+---------+

If all columns have non-null values:

df2.join(df1, on = df2.schema.names, how = "left_anti").show()

dsk · Answer

One more easy solution would be using exceptAll() function. Doc says-

Return a new SparkDataFrame containing rows in this SparkDataFrame but not in another SparkDataFrame while preserving the duplicates. This is equivalent to EXCEPT ALL in SQL. Also as standard in SQL, this function resolves columns by position (not by name)

Create the DF Here

df_a = spark.createDataFrame([(1,"chicago","USA","NA","NA"),(2,"houston","USA","NA","NA"),(3,"Sydney","Australia","AU","AU"),(4,"London","UK","EU","EU")],[ "id","city","country","region","continent"])
df_a.show(truncate=False)
df_b = spark.createDataFrame([(1,"chicago","USA","NA","NA"),(2,"houston","USA","NA","NA"),(3,"Paris","France","EU","EU"),(5,"London","UK","EU","EU")],[ "id","city","country","region","continent"])
df_b.show(truncate=False)

df_a

+---+-------+---------+------+---------+
|id |city   |country  |region|continent|
+---+-------+---------+------+---------+
|1  |chicago|USA      |NA    |NA       |
|2  |houston|USA      |NA    |NA       |
|3  |Sydney |Australia|AU    |AU       |
|4  |London |UK       |EU    |EU       |
+---+-------+---------+------+---------+

df_b

+---+-------+-------+------+---------+
|id |city   |country|region|continent|
+---+-------+-------+------+---------+
|1  |chicago|USA    |NA    |NA       |
|2  |houston|USA    |NA    |NA       |
|3  |Paris  |France |EU    |EU       |
|5  |London |UK     |EU    |EU       |
+---+-------+-------+------+---------+

Final output

df_final = df_b.exceptAll(df_a)
df_final.show()
+---+------+-------+------+---------+
| id|  city|country|region|continent|
+---+------+-------+------+---------+
|  3| Paris| France|    EU|       EU|
|  5|London|     UK|    EU|       EU|
+---+------+-------+------+---------+

Pyspark : Subtracting/Difference pyspark dataframes based on all columns

Tags:

dataframe

pyspark

Codegator

2 Answers

Cena

Create the DF Here

Final output

dsk

Recent Activity

Donate For Us

Pyspark : Subtracting/Difference pyspark dataframes based on all columns

Tags:

dataframe

pyspark

Codegator

2 Answers

Cena

Create the DF Here

Final output

dsk

Related questions

Recent Activity

Donate For Us