I have a dataframe named 'new_emp_final_1'. When I try to derive a column 'difficulty' from cookTime and prepTime, by calling the function difficulty from a udf, it is giving me error.
new_emp_final_1.dtypes is below -
[('name', 'string'), ('ingredients', 'string'), ('url', 'string'), ('image', 'string'), ('cookTime', 'string'), ('recipeYield', 'string'), ('datePublished', 'strin
g'), ('prepTime', 'string'), ('description', 'string')]
Result of new_emp_final_1.schema is -
StructType(List(StructField(name,StringType,true),StructField(ingredients,StringType,true),StructField(url,StringType,true),StructField(image,StringType,true),StructField(cookTime,StringType,true),StructField(recipeYield,StringType,true),StructField(datePublished,StringType,true),StructField(prepTime,StringType,true),StructField(description,StringType,true)))
Code:
def difficulty(cookTime, prepTime):
if not cookTime or not prepTime:
return "Unkown"
total_duration = cookTime + prepTime
if total_duration > 3600:
return "Hard"
elif total_duration > 1800 and total_duration < 3600:
return "Medium"
elif total_duration < 1800:
return "Easy"
else:
return "Unkown"
func_udf = udf(difficulty, IntegerType())
new_emp_final_1 = new_emp_final_1.withColumn("difficulty", func_udf(new_emp_final_1.cookTime, new_emp_final_1.prepTime))
new_emp_final_1.show(20,False)
Error is -
File "/home/raghavcomp32915/mypycode.py", line 56, in <module>
func_udf = udf(difficulty, IntegerType())
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/udf.py", line 186, in wrapper
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/udf.py", line 166, in __call__
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/column.py", line 66, in _to_seq
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/column.py", line 54, in _to_java_column
TypeError: Invalid argument, not a string or column: <function difficulty at 0x7f707e9750c8> of type <type 'function'>. For column literals, use 'lit', 'array', 's
truct' or 'create_map' function.
I am expecting a column named difficulty in existing dataframe new_emp_final_1 with values as Hard, Medium, Easy or Unknown.
I ran into this issue with Python’s sum
because there was a conflict with Spark’s SQL sum
— a real-life illustration of why this :
from pyspark.sql.functions import *
is bad.
It goes without saying that the solution was to either restrict the import to the needed functions or to import pyspark.sql.functions
and prefix the needed functions with it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With