I want to use a UDF in pySpark which doesn't return an atomic value but a nested structure. I know that I can register the UDF and manually set the schema of the object it will return, e.g.
format = ArrayType(
StructType([
StructField('id',IntegerType()),
StructField('text',StringType())
]
)
spark.udf.register('functionName', functionObject, format)
and use python lists inside the UDF to match the format, e.g.
return [[1,'A'],[2,'B']]
but is there any way to avoid explicitly setting the return type when registering the UDF, and instead automatically infer its schema?
If I don't set a return type, it is automatically set to StringType.
is there any way to avoid explicitly setting the return type when registering the UDF, and instead automatically infer its schema?
There is not. Schema has to be known before udf is called and it cannot be inferred on runtime.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With