Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pivot on two columns with both numeric and categorical value in pySpark

I have a data set in pyspark like this : from collections import namedtuple

    user_row = namedtuple('user_row', 'id time category value'.split())
    data = [
        user_row(1,1,'speed','50'),
        user_row(1,1,'speed','60'),
        user_row(1,2,'door', 'open'),
        user_row(1,2,'door','open'),
        user_row(1,2,'door','close'),
        user_row(1,2,'speed','75'),
        user_row(2,10,'speed','30'), 
        user_row(2,11,'door', 'open'),
        user_row(2,12,'door','open'),
        user_row(2,13,'speed','50'),
        user_row(2,13,'speed','40')
    ]
    
    user_df = spark.createDataFrame(data)
    user_df.show()
+---+----+--------+-----+
| id|time|category|value|
+---+----+--------+-----+
|  1|   1|   speed|   50|
|  1|   1|   speed|   60|
|  1|   2|    door| open|
|  1|   2|    door| open|
|  1|   2|    door|close|
|  1|   2|   speed|   75|
|  2|  10|   speed|   30|
|  2|  11|    door| open|
|  2|  12|    door| open|
|  2|  13|   speed|   50|
|  2|  13|   speed|   40|
+---+----+--------+-----+

What I want to get is something like below where grouping by id and time and pivot on category and if it is numeric return the average and if it is categorical it returns the mode.

+---+----+--------+-----+
| id|time|    door|speed|
+---+----+--------+-----+
|  1|   1|    null|   55|
|  1|   2|    open|   75|
|  2|  10|    null|   30|
|  2|  11|    open| null|
|  2|  12|    open| null|
|  2|  13|    null|   45|
+---+----+--------+-----+

I tried this but for categorical value it returns null (I am not worry about nulls in speed column)

    df = user_df\
    .groupBy('id','time')\
    .pivot('category')\
    .agg(avg('value'))\
    .orderBy(['id', 'time'])\
    
    df.show()

+---+----+----+-----+
| id|time|door|speed|
+---+----+----+-----+
|  1|   1|null| 55.0|
|  1|   2|null| 75.0|
|  2|  10|null| 30.0|
|  2|  11|null| null|
|  2|  12|null| null|
|  2|  13|null| 45.0|
+---+----+----+-----+
like image 869
Maddie Avatar asked Dec 20 '25 04:12

Maddie


1 Answers

You can do an additional pivot and coalesce them. try this.

import pyspark.sql.functions as F
from collections import namedtuple

user_row = namedtuple('user_row', 'id time category value'.split())
data = [
    user_row(1,1,'speed','50'),
    user_row(1,1,'speed','60'),
    user_row(1,2,'door', 'open'),
    user_row(1,2,'door','open'),
    user_row(1,2,'door','close'),
    user_row(1,2,'speed','75'),
    user_row(2,10,'speed','30'), 
    user_row(2,11,'door', 'open'),
    user_row(2,12,'door','open'),
    user_row(2,13,'speed','50'),
    user_row(2,13,'speed','40')
]

user_df = spark.createDataFrame(data)
#%%
#user_df.show()
df = user_df.groupBy('id','time')\
            .pivot('category')\
            .agg(F.avg('value').alias('avg'),F.max('value').alias('max'))\
#%%
expr1= [x for x in df.columns if '_avg' in x]
expr2= [x for x in df.columns if 'max' in x]
expr=zip(expr1,expr2)
#%%
sel_expr= [F.coalesce(x[0],x[1]).alias(x[0].split('_')[0]) for x in expr]
#%%
    
df_final = df.select('id','time',*sel_expr).orderBy('id','time')

df_final.show()
+---+----+----+-----+
| id|time|door|speed|
+---+----+----+-----+
|  1|   1|null| 55.0|
|  1|   2|open| 75.0|
|  2|  10|null| 30.0|
|  2|  11|open| null|
|  2|  12|open| null|
|  2|  13|null| 45.0|
+---+----+----+-----+
like image 113
Raghu Avatar answered Dec 21 '25 18:12

Raghu



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!