Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PySpark program is throwing error "TypeError: Invalid argument, not a string or column"

I have a dataframe named 'new_emp_final_1'. When I try to derive a column 'difficulty' from cookTime and prepTime, by calling the function difficulty from a udf, it is giving me error.

new_emp_final_1.dtypes is below -

[('name', 'string'), ('ingredients', 'string'), ('url', 'string'), ('image', 'string'), ('cookTime', 'string'), ('recipeYield', 'string'), ('datePublished', 'strin
g'), ('prepTime', 'string'), ('description', 'string')]

Result of new_emp_final_1.schema is -

StructType(List(StructField(name,StringType,true),StructField(ingredients,StringType,true),StructField(url,StringType,true),StructField(image,StringType,true),StructField(cookTime,StringType,true),StructField(recipeYield,StringType,true),StructField(datePublished,StringType,true),StructField(prepTime,StringType,true),StructField(description,StringType,true)))

Code:

def difficulty(cookTime, prepTime):   
    if not cookTime or not prepTime:
        return "Unkown"

    total_duration = cookTime + prepTime
    if total_duration > 3600:
        return "Hard"
    elif total_duration > 1800 and total_duration < 3600:
        return "Medium"
    elif total_duration < 1800:
        return "Easy" 
    else: 
        return "Unkown"

func_udf = udf(difficulty, IntegerType())
new_emp_final_1 = new_emp_final_1.withColumn("difficulty", func_udf(new_emp_final_1.cookTime, new_emp_final_1.prepTime))
new_emp_final_1.show(20,False)

Error is -

File "/home/raghavcomp32915/mypycode.py", line 56, in <module> func_udf = udf(difficulty, IntegerType()) File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/udf.py", line 186, in wrapper File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/udf.py", line 166, in __call__ File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/column.py", line 66, in _to_seq File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/column.py", line 54, in _to_java_column TypeError: Invalid argument, not a string or column: <function difficulty at 0x7f707e9750c8> of type <type 'function'>. For column literals, use 'lit', 'array', 's truct' or 'create_map' function.

I am expecting a column named difficulty in existing dataframe new_emp_final_1 with values as Hard, Medium, Easy or Unknown.

like image 235
Raghavendra Gupta Avatar asked Sep 02 '25 16:09

Raghavendra Gupta


1 Answers

I ran into this issue with Python’s sum because there was a conflict with Spark’s SQL sum — a real-life illustration of why this :

from pyspark.sql.functions import *

is bad.

It goes without saying that the solution was to either restrict the import to the needed functions or to import pyspark.sql.functions and prefix the needed functions with it.

like image 75
Skippy le Grand Gourou Avatar answered Sep 04 '25 04:09

Skippy le Grand Gourou