Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Trying to create a column with the maximum timestamp in PySpark DataFrame

I am really new to PySpark. All I want to do is to find the maximum of the "date" column and add a new column in the dataframe that has this maximum date for all the rows (repeating) such that:

A      B                                        C
a  timestamp1                              timestamp3
b  timestamp2    -------------------->     timestamp3
c  timestamp3                              timestamp3

I use the following line of code:

df.withColumn('dummy_column',f.lit((f.max('date'))).cast('timestamp')).show(9)

But I get the error:

> AnalysisException: grouping expressions sequence is empty, and
> '`part`' is not an aggregate function. Wrap '(CAST(max(`date`) AS
> TIMESTAMP) AS `new_column`)' in windowing function(s) or wrap '`part`'
> in first() (or first_value) if you don't care which value you get.;;

Can anyone help me understand why do I get this error and how can I work around it?

like image 487
rocketsfallonrocketfalls Avatar asked Sep 02 '25 08:09

rocketsfallonrocketfalls


2 Answers

You're probably looking for:

import pyspark.sql.functions as f
from pyspark.sql.window import Window

w = Window.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df.withColumn('dummy_column',f.max('date').over(w).cast('timestamp')).show(9)

Aggregate functions like max work with a window or a grouping operation. They cannot work on their own because you did not specify the range of rows over which the aggregate functions operates.

like image 186
mck Avatar answered Sep 05 '25 00:09

mck


There is one more alterenative to get max_timestamp is just grpupBy() the existing dataframe and use max() in order to get the max timestamp , take that in a variable and use as per your requirement

Create the Df Here

df = spark.createDataFrame([(1,"2020-10-13"),(2,"2020-10-14"),(3,"2020-10-15")],[ "id","ts"])
df.show()
#df_max = df.groupBy("ts").agg(F.max("ts").alias("max_ts"))
df_max_var = df_max.collect()[0]['max_ts']
# Taking into a variable for future use
df = df.withColumn("dummy_col", F.lit(df_max_var))
df.show()

Input

+---+----------+
| id|        ts|
+---+----------+
|  1|2020-10-13|
|  2|2020-10-14|
|  3|2020-10-15|
+---+----------+

Output

+---+----------+----------+
| id|        ts| dummy_col|
+---+----------+----------+
|  1|2020-10-13|2020-10-15|
|  2|2020-10-14|2020-10-15|
|  3|2020-10-15|2020-10-15|
+---+----------+----------+
like image 44
dsk Avatar answered Sep 05 '25 00:09

dsk