I have two dataframes, df1 and df2:
df1.show()
+---+--------+-----+----+--------+
|cA | cB | cC | cD | cE |
+---+--------+-----+----+--------+
| A| abc | 0.1 | 0.0| 0 |
| B| def | 0.15| 0.5| 0 |
| C| ghi | 0.2 | 0.2| 1 |
| D| jkl | 1.1 | 0.1| 0 |
| E| mno | 0.1 | 0.1| 0 |
+---+--------+-----+----+--------+
df2.show()
+---+--------+-----+----+--------+
|cA | cB | cH | cI | cJ |
+---+--------+-----+----+--------+
| A| abc | a | b | ? |
| C| ghi | a | c | ? |
+---+--------+-----+----+--------+
I would like to update cE column in df1 and set it to 1, if the row is referenced in df2. Each record is identified by cA and cB columns.
Below is the desired output; Note that the cE value of the first record was updated to 1:
+---+--------+-----+----+--------+
|cA | cB | cC | cD | cE |
+---+--------+-----+----+--------+
| A| abc | 0.1 | 0.0| 1 |
| B| def | 0.15| 0.5| 0 |
| C| ghi | 0.2 | 0.2| 1 |
| D| jkl | 1.1 | 0.1| 0 |
| E| mno | 0.1 | 0.1| 0 |
+---+--------+-----+----+--------+
When there is scenario of updating a column value based on another column, then the when clause comes handy. Please Refer the when and otherwise clause.
import pyspark.sql.functions as F
df3=df1.join(df2,(df1.cA==df2.cA)&(df1.cB==df2.cB),"full").withColumn('cE',F.when((df1.cA==df2.cA)&(df1.cB==df2.cB),1).otherwise(0)).select(df1.cA,df1.cB,df1.cC,df1.cD,'cE')
df3.show()
+---+---+----+---+---+
| cA| cB| cC| cD| cE|
+---+---+----+---+---+
| E|mno| 0.1|0.1| 0|
| B|def|0.15|0.5| 0|
| C|ghi| 0.2|0.2| 1|
| A|abc| 0.1|0.0| 1|
| D|jkl| 1.1|0.1| 0|
+---+---+----+---+---+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With