I'm trying to convert values from row into different columns with its value from another column. For example -
Input dataframe is like -
+-----------+
| X | Y | Z |
+-----------+
| 1 | A | a |
| 2 | A | b |
| 3 | A | c |
| 1 | B | d |
| 3 | B | e |
| 2 | C | f |
+-----------+
And the output dataframe should be like this -
+------------------------+
| Y | 1 | 2 | 3 |
+------------------------+
| A | a | b | c |
| B | d | null | e |
| C | null | f | null |
+------------------------+
I've tried to groupBy the values based on Y and collect_list on X and Z and then zipped X & Z together to get some sort of key-value pairs. But some Xs may be missing for some values of Y so in order to fill them with null values, I cross joined all possible values of X and all possible values of Y and then joined it my original dataframe. This is approach is highly inefficient.
Is there any efficient method to approach this problem ? Thanks in advance.
You can simply use groupBy
with pivot
and first
as aggregate function as
import org.apache.spark.sql.functions._
df.groupBy("Y").pivot("X").agg(first("z"))
Output:
+---+----+----+----+
|Y |1 |2 |3 |
+---+----+----+----+
|B |d |null|e |
|C |null|f |null|
|A |a |b |c |
+---+----+----+----+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With