I need the resulting data frame in the line below, to have an alias name "maxDiff" for the max('diff') column after groupBy. However, the below line does not makeany change, nor throw an error.
 grpdf = joined_df.groupBy(temp1.datestamp).max('diff').alias("maxDiff") Use alias() Use sum() SQL function to perform summary aggregation that returns a Column type, and use alias() of Column type to rename a DataFrame column.
When we perform groupBy() on PySpark Dataframe, it returns GroupedData object which contains below aggregate functions. count() – Use groupBy() count() to return the number of rows for each group. mean() – Returns the mean of values for each group. max() – Returns the maximum of values for each group.
You can use agg instead of calling max method:
from pyspark.sql.functions import max  joined_df.groupBy(temp1.datestamp).agg(max("diff").alias("maxDiff")) Similarly in Scala
import org.apache.spark.sql.functions.max  joined_df.groupBy($"datestamp").agg(max("diff").alias("maxDiff")) or
joined_df.groupBy($"datestamp").agg(max("diff").as("maxDiff")) This is because you are aliasing the whole DataFrame object, not Column. Here's an example how to alias the Column only:
import pyspark.sql.functions as func  grpdf = joined_df \     .groupBy(temp1.datestamp) \     .max('diff') \     .select(func.col("max(diff)").alias("maxDiff")) If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With