Change month numbers to month name in a dataframe (PySpark)

Question

I have a column of month numbers in a dataframe and want to change it to month name, so I used this which resulted in a type error:

df['monthName'] = df['monthNumber'].apply(lambda x: calendar.month_name[x])

TypeError: 'Column' object is not callable

How to get month name?

I'm using Spark 2.1.1 and Python 2.7.6.

This is my code for Airline data Analysis:

df_withDelay = df_mappedCarrierNames.filter(df_mappedCarrierNames.ArrDelay > 0)
sqlContext.registerDataFrameAsTable(df_withDelay,"SFO_ArrDelayAnalysisTable")
df_SFOArrDelay = sqlContext.sql \
                      ("select sfo.Month, sum(sfo.ArrDelay) as TotalArrivalDelay \
                      from SFO_ArrDelayAnalysisTable sfo \
                      where (sfo.Dest = 'SFO') \
                      group by sfo.Month")

I am trying to plot a graph with Month vs ArrDelay. From the above code I am getting Month as number. So I tried with the below option

udf = UserDefinedFunction(lambda x: calendar.month_abbr[int(x)], StringType())
new_df_mappedCarrierNames = df_mappedCarrierNames.select(*[udf(column).alias(name) if column == name else column for column in df_mappedCarrierNames.columns])

It works but, in my graph it's not in sorted order. whereas if I use the month numbers, it is in sorted order. My issue is in finding out how to map month numbers to month names in sorted order from Jan to dec.

Krzysztof Przysowa · Accepted Answer

I would avoid using UDFs if possible (as they don't scale well). Try the combination of to_date(), date_format() and casting to integer:

from pyspark.sql.functions import col

df = df.withColumn('monthNumber', date_format(to_date(col('monthName'), 'MMMMM'), 'MM').cast('int'))

Details of date formatting codes: http://tutorials.jenkov.com/java-internationalization/simpledateformat.html

Change month numbers to month name in a dataframe (PySpark)

Tags:

date

dataframe

apache-spark

apache-spark-sql

pyspark

anaga

1 Answers

Krzysztof Przysowa

Recent Activity

Donate For Us

Change month numbers to month name in a dataframe (PySpark)

Tags:

date

dataframe

apache-spark

apache-spark-sql

pyspark

anaga

1 Answers

Krzysztof Przysowa

Related questions

Recent Activity

Donate For Us