Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dynamically infer Schema of returned object from UDF in pySpark

I want to use a UDF in pySpark which doesn't return an atomic value but a nested structure. I know that I can register the UDF and manually set the schema of the object it will return, e.g.

format = ArrayType(
                   StructType([
                               StructField('id',IntegerType()),
                               StructField('text',StringType())
                              ]
                  )
spark.udf.register('functionName', functionObject, format)

and use python lists inside the UDF to match the format, e.g.

return [[1,'A'],[2,'B']]

but is there any way to avoid explicitly setting the return type when registering the UDF, and instead automatically infer its schema?

If I don't set a return type, it is automatically set to StringType.

like image 202
Johnny16 Avatar asked Oct 20 '25 16:10

Johnny16


1 Answers

is there any way to avoid explicitly setting the return type when registering the UDF, and instead automatically infer its schema?

There is not. Schema has to be known before udf is called and it cannot be inferred on runtime.

like image 128
user7718275 Avatar answered Oct 23 '25 06:10

user7718275



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!