PySpark program is throwing error "TypeError: Invalid argument, not a string or column"

Question

I have a dataframe named 'new_emp_final_1'. When I try to derive a column 'difficulty' from cookTime and prepTime, by calling the function difficulty from a udf, it is giving me error.

new_emp_final_1.dtypes is below -

[('name', 'string'), ('ingredients', 'string'), ('url', 'string'), ('image', 'string'), ('cookTime', 'string'), ('recipeYield', 'string'), ('datePublished', 'strin
g'), ('prepTime', 'string'), ('description', 'string')]

Result of new_emp_final_1.schema is -

StructType(List(StructField(name,StringType,true),StructField(ingredients,StringType,true),StructField(url,StringType,true),StructField(image,StringType,true),StructField(cookTime,StringType,true),StructField(recipeYield,StringType,true),StructField(datePublished,StringType,true),StructField(prepTime,StringType,true),StructField(description,StringType,true)))

Code:

def difficulty(cookTime, prepTime):   
    if not cookTime or not prepTime:
        return "Unkown"

    total_duration = cookTime + prepTime
    if total_duration > 3600:
        return "Hard"
    elif total_duration > 1800 and total_duration < 3600:
        return "Medium"
    elif total_duration < 1800:
        return "Easy" 
    else: 
        return "Unkown"

func_udf = udf(difficulty, IntegerType())
new_emp_final_1 = new_emp_final_1.withColumn("difficulty", func_udf(new_emp_final_1.cookTime, new_emp_final_1.prepTime))
new_emp_final_1.show(20,False)

Error is -

File "/home/raghavcomp32915/mypycode.py", line 56, in <module> func_udf = udf(difficulty, IntegerType()) File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/udf.py", line 186, in wrapper File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/udf.py", line 166, in __call__ File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/column.py", line 66, in _to_seq File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/column.py", line 54, in _to_java_column TypeError: Invalid argument, not a string or column: <function difficulty at 0x7f707e9750c8> of type <type 'function'>. For column literals, use 'lit', 'array', 's truct' or 'create_map' function.

I am expecting a column named difficulty in existing dataframe new_emp_final_1 with values as Hard, Medium, Easy or Unknown.

Skippy le Grand Gourou · Accepted Answer

I ran into this issue with Python’s sum because there was a conflict with Spark’s SQL sum — a real-life illustration of why this :

from pyspark.sql.functions import *

is bad.

It goes without saying that the solution was to either restrict the import to the needed functions or to import pyspark.sql.functions and prefix the needed functions with it.

PySpark program is throwing error "TypeError: Invalid argument, not a string or column"

Tags:

python

apache-spark

apache-spark-sql

pyspark

Raghavendra Gupta

1 Answers

Skippy le Grand Gourou

Recent Activity

Donate For Us

PySpark program is throwing error "TypeError: Invalid argument, not a string or column"

Tags:

python

apache-spark

apache-spark-sql

pyspark

Raghavendra Gupta

1 Answers

Skippy le Grand Gourou

Related questions

Recent Activity

Donate For Us