Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark : Subtracting/Difference pyspark dataframes based on all columns

I have two pyspark dataframes like below -

df1

id     city      country       region    continent
1      chicago    USA          NA         NA
2      houston    USA          NA         NA
3      Sydney     Australia    AU         AU
4      London     UK           EU         EU

df2

id     city      country       region    continent
1      chicago    USA          NA         NA
2      houston    USA          NA         NA
3      Paris      France       EU         EU
5      London     UK           EU         EU

I want to find out the rows which exists in df2 but not in df1 based on all column values. So df2 - df1 should result in df_result like below

df_result

id     city      country       region    continent
3      Paris      France       EU         EU
5      London     UK           EU         EU

How can I achieve it in pyspark. Thanks in advance

like image 320
Codegator Avatar asked Nov 30 '25 08:11

Codegator


2 Answers

You can use a left_anti join:

df2.join(df1, on = ["id", "city", "country"], how = "left_anti").show()

+---+------+-------+------+---------+
| id|  city|country|region|continent|
+---+------+-------+------+---------+
|  3| Paris| France|    EU|       EU|
|  5|London|     UK|    EU|       EU|
+---+------+-------+------+---------+

If all columns have non-null values:

df2.join(df1, on = df2.schema.names, how = "left_anti").show()
like image 63
Cena Avatar answered Dec 02 '25 05:12

Cena


One more easy solution would be using exceptAll() function. Doc says-

Return a new SparkDataFrame containing rows in this SparkDataFrame but not in another SparkDataFrame while preserving the duplicates. This is equivalent to EXCEPT ALL in SQL. Also as standard in SQL, this function resolves columns by position (not by name)

Create the DF Here

df_a = spark.createDataFrame([(1,"chicago","USA","NA","NA"),(2,"houston","USA","NA","NA"),(3,"Sydney","Australia","AU","AU"),(4,"London","UK","EU","EU")],[ "id","city","country","region","continent"])
df_a.show(truncate=False)
df_b = spark.createDataFrame([(1,"chicago","USA","NA","NA"),(2,"houston","USA","NA","NA"),(3,"Paris","France","EU","EU"),(5,"London","UK","EU","EU")],[ "id","city","country","region","continent"])
df_b.show(truncate=False)

df_a

+---+-------+---------+------+---------+
|id |city   |country  |region|continent|
+---+-------+---------+------+---------+
|1  |chicago|USA      |NA    |NA       |
|2  |houston|USA      |NA    |NA       |
|3  |Sydney |Australia|AU    |AU       |
|4  |London |UK       |EU    |EU       |
+---+-------+---------+------+---------+

df_b

+---+-------+-------+------+---------+
|id |city   |country|region|continent|
+---+-------+-------+------+---------+
|1  |chicago|USA    |NA    |NA       |
|2  |houston|USA    |NA    |NA       |
|3  |Paris  |France |EU    |EU       |
|5  |London |UK     |EU    |EU       |
+---+-------+-------+------+---------+

Final output

df_final = df_b.exceptAll(df_a)
df_final.show()
+---+------+-------+------+---------+
| id|  city|country|region|continent|
+---+------+-------+------+---------+
|  3| Paris| France|    EU|       EU|
|  5|London|     UK|    EU|       EU|
+---+------+-------+------+---------+
like image 38
dsk Avatar answered Dec 02 '25 03:12

dsk



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!