Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Renaming columns for PySpark DataFrame aggregates

I am analysing some data with PySpark DataFrames. Suppose I have a DataFrame df that I am aggregating:

(df.groupBy("group")    .agg({"money":"sum"})    .show(100) ) 

This will give me:

group                SUM(money#2L) A                    137461285853 B                    172185566943 C                    271179590646 

The aggregation works just fine but I dislike the new column name SUM(money#2L). Is there a way to rename this column into something human readable from the .agg method? Maybe something more similar to what one would do in dplyr:

df %>% group_by(group) %>% summarise(sum_money = sum(money)) 
like image 516
cantdutchthis Avatar asked May 01 '15 14:05

cantdutchthis


People also ask

How do I rename a column in groupBy?

The current (as of version 0.20) method for changing column names after a groupby operation is to chain the rename method. See this deprecation note in the documentation for more detail.


1 Answers

Although I still prefer dplyr syntax, this code snippet will do:

import pyspark.sql.functions as sf  (df.groupBy("group")    .agg(sf.sum('money').alias('money'))    .show(100)) 

It gets verbose.

like image 83
cantdutchthis Avatar answered Sep 17 '22 13:09

cantdutchthis



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!