Update a column in a dataframe, based on the values in another dataframe

Question

I have two dataframes, df1 and df2:

df1.show()
+---+--------+-----+----+--------+
|cA |   cB   |  cC | cD |   cE   |
+---+--------+-----+----+--------+
|  A|   abc  | 0.1 | 0.0|   0    |
|  B|   def  | 0.15| 0.5|   0    |
|  C|   ghi  | 0.2 | 0.2|   1    |
|  D|   jkl  | 1.1 | 0.1|   0    |
|  E|   mno  | 0.1 | 0.1|   0    |
+---+--------+-----+----+--------+


df2.show()
+---+--------+-----+----+--------+
|cA |   cB   |  cH | cI |   cJ   |
+---+--------+-----+----+--------+
|  A|   abc  | a   | b  |   ?    |
|  C|   ghi  | a   | c  |   ?    |
+---+--------+-----+----+--------+

I would like to update cE column in df1 and set it to 1, if the row is referenced in df2. Each record is identified by cA and cB columns.

Below is the desired output; Note that the cE value of the first record was updated to 1:

+---+--------+-----+----+--------+
|cA |   cB   |  cC | cD |   cE   |
+---+--------+-----+----+--------+
|  A|   abc  | 0.1 | 0.0|   1    |
|  B|   def  | 0.15| 0.5|   0    |
|  C|   ghi  | 0.2 | 0.2|   1    |
|  D|   jkl  | 1.1 | 0.1|   0    |
|  E|   mno  | 0.1 | 0.1|   0    |
+---+--------+-----+----+--------+

Prathik Kini · Accepted Answer

When there is scenario of updating a column value based on another column, then the when clause comes handy. Please Refer the when and otherwise clause.

import pyspark.sql.functions as F
df3=df1.join(df2,(df1.cA==df2.cA)&(df1.cB==df2.cB),"full").withColumn('cE',F.when((df1.cA==df2.cA)&(df1.cB==df2.cB),1).otherwise(0)).select(df1.cA,df1.cB,df1.cC,df1.cD,'cE')
df3.show()
+---+---+----+---+---+
| cA| cB|  cC| cD| cE|
+---+---+----+---+---+
|  E|mno| 0.1|0.1|  0|
|  B|def|0.15|0.5|  0|
|  C|ghi| 0.2|0.2|  1|
|  A|abc| 0.1|0.0|  1|
|  D|jkl| 1.1|0.1|  0|
+---+---+----+---+---+

Update a column in a dataframe, based on the values in another dataframe

Tags:

python

dataframe

apache-spark

apache-spark-sql

pyspark

Tokyo

1 Answers

Prathik Kini

Recent Activity

Donate For Us

Update a column in a dataframe, based on the values in another dataframe

Tags:

python

dataframe

apache-spark

apache-spark-sql

pyspark

Tokyo

1 Answers

Prathik Kini

Related questions

Recent Activity

Donate For Us