Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using pyspark groupBy with a custom function in agg

I would like to groupBy my spark df with custom agg function:

def gini(list_of_values):
    sth is processing here
    return number output

enter image description here

I would like to get sth like that:

df.groupby('activity')['mean_event_duration_in_hours].agg(gini)

Could you please help me to resolve this tackle?

like image 488
Sebastian Kowalczykiewicz Avatar asked Sep 14 '25 21:09

Sebastian Kowalczykiewicz


1 Answers

You can create a udf like so:

import pyspark.sql.functions as F
from pyspark.sql.types import FloatType

def gini(list_of_values):
    # sth is processing here
    return number_output

udf_gini = F.udf(gini, FloatType())

df.groupby('activity')\
    .agg(F.collect_list("mean_event_duration_in_hours").alias("event_duration_list"))\
    .withColumn("gini", udf_gini(F.col("event_duration_list")))

Or define gini as a UDF like this:

@udf(returnType=FloatType())
def gini(list_of_values):
    # sth is processing here
    return number_output

df.groupby('activity')\
    .agg(F.collect_list("mean_event_duration_in_hours").alias("event_duration_list"))\
    .withColumn("gini", gini(F.col("event_duration_list")))
like image 139
Jan Jaap Meijerink Avatar answered Sep 17 '25 09:09

Jan Jaap Meijerink