I have a column of month numbers in a dataframe and want to change it to month name, so I used this which resulted in a type error:
df['monthName'] = df['monthNumber'].apply(lambda x: calendar.month_name[x])
TypeError: 'Column' object is not callable
How to get month name?
I'm using Spark 2.1.1 and Python 2.7.6.
This is my code for Airline data Analysis:
df_withDelay = df_mappedCarrierNames.filter(df_mappedCarrierNames.ArrDelay > 0)
sqlContext.registerDataFrameAsTable(df_withDelay,"SFO_ArrDelayAnalysisTable")
df_SFOArrDelay = sqlContext.sql \
("select sfo.Month, sum(sfo.ArrDelay) as TotalArrivalDelay \
from SFO_ArrDelayAnalysisTable sfo \
where (sfo.Dest = 'SFO') \
group by sfo.Month")
I am trying to plot a graph with Month vs ArrDelay. From the above code I am getting Month as number. So I tried with the below option
udf = UserDefinedFunction(lambda x: calendar.month_abbr[int(x)], StringType())
new_df_mappedCarrierNames = df_mappedCarrierNames.select(*[udf(column).alias(name) if column == name else column for column in df_mappedCarrierNames.columns])
It works but, in my graph it's not in sorted order. whereas if I use the month numbers, it is in sorted order. My issue is in finding out how to map month numbers to month names in sorted order from Jan to dec.
I would avoid using UDFs if possible (as they don't scale well). Try the combination of to_date(), date_format() and casting to integer:
from pyspark.sql.functions import col
df = df.withColumn('monthNumber', date_format(to_date(col('monthName'), 'MMMMM'), 'MM').cast('int'))
Details of date formatting codes: http://tutorials.jenkov.com/java-internationalization/simpledateformat.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With