I am analysing some data with PySpark DataFrames. Suppose I have a DataFrame df
that I am aggregating:
(df.groupBy("group") .agg({"money":"sum"}) .show(100) )
This will give me:
group SUM(money#2L) A 137461285853 B 172185566943 C 271179590646
The aggregation works just fine but I dislike the new column name SUM(money#2L)
. Is there a way to rename this column into something human readable from the .agg
method? Maybe something more similar to what one would do in dplyr
:
df %>% group_by(group) %>% summarise(sum_money = sum(money))
The current (as of version 0.20) method for changing column names after a groupby operation is to chain the rename method. See this deprecation note in the documentation for more detail.
Although I still prefer dplyr syntax, this code snippet will do:
import pyspark.sql.functions as sf (df.groupBy("group") .agg(sf.sum('money').alias('money')) .show(100))
It gets verbose.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With