Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Rounding hours of datetime in PySpark

I'm trying to round hours using pyspark and udf.

The function works properly in python but not well when using pyspark.

The input is :

date = Timestamp('2016-11-18 01:45:55') # type is pandas._libs.tslibs.timestamps.Timestamp

def time_feature_creation_spark(date):
    return date.round("H").hour

time_feature_creation_udf = udf(lambda x : time_feature_creation_spark(x), IntegerType())

enter image description here

Then I use it in the function that feeds spark :

data = data.withColumn("hour", time_feature_creation_udf(data["date"])

And the error is :

TypeError: 'Column' object is not callable

The expected output is just the closest hour from the time in the datetime (e.g. 20h45 is closest to 21h, so returns 21)

like image 205
LaSul Avatar asked Jan 24 '26 19:01

LaSul


1 Answers

A nicer version than /3600*3600 is using the built-in function date_trunc

import pyspark.sql.functions as F
return df.withColumn("hourly_timestamp", F.date_trunc("hour", df.timestamp))

other formats besides hour are

year’, ‘yyyy’, ‘yy’, ‘month’, ‘mon’, ‘mm’, ‘day’, ‘dd’, ‘hour’, ‘minute’, ‘second’, ‘week’, ‘quarter’

like image 146
LN_P Avatar answered Jan 26 '26 10:01

LN_P



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!