Is there a good way to use datediff with months? To clarify: the datediff method takes two columns and returns the number of days that have passed between the two dates. I'd like to have that in months. I want to have a parameter in my function that can I tell to check data say from the last 20, 36, whatever months. If I just do datediff and divide the result with 30 (or 31), than the result is not quite accurate. I could use 30.4166667 (= 365 days/12 months), but that is not quite accurate either for shorter periods. So, any tips on how to use datediff to be able to get months out of it? SQL has it like SELECT DATEDIFF(month, '2005-12-31 23:59:59.9999999', '2006-01-01 00:00:00.0000000');, I'm looking for something like this in Spark.
Using PySpark SQL functions datediff() , months_between() you can calculate the difference between two dates in days, months, and year, let's see this by using a DataFrame example. You can also use these to calculate age.
Spark SQL provides DataFrame function add_months() to add or subtract months from a Date Column and date_add() , date_sub() to add and subtract days.
pyspark.sql.functions. datediff (end, start)[source] Returns the number of days from start to end .
Timestamp difference in PySpark can be calculated by using 1) unix_timestamp() to get the Time in seconds and subtract with other time to get the seconds 2) Cast TimestampType column to LongType and subtract two long values to get the difference in seconds, divide it by 60 to get the minute difference and finally ...
You can try months_between:
import org.apache.spark.sql.functions.*
DataFrame newDF = df.withColumn("monthDiff", months_between(col("col1"), col("col2"))
This worked for me:
from pyspark.sql.functions import months_between
data = sqlContext.sql('''
SELECT DISTINCT mystartdate,myenddate,
 CAST(months_between(mystartdate,myenddate) as int) as months_tenure
FROM mydatabase
''')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With