I can't seem to figure out how to use withField to update a nested dataframe column, I always seem to get 'TypeError: 'Column' object is not callable'.
I have followed this example: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.Column.withField.html
df = spark.createDataFrame([Row(a=Row(b=1, c=2))])
df.withColumn('a', df['a'].withField('b', lit(3))).select('a.b').show()
Which still results in:
Traceback (most recent call last):
File "C:\Users\benhalicki\Source\SparkTest\spark_nested_df_test.py", line 58, in <module>
df.withColumn('a', df['a'].withField('b', lit(3))).select('a.b').show()
TypeError: 'Column' object is not callable
Spark Version: 3.0.3 (on Windows).
Am I doing something fundamentally wrong?
withField was introduced in Spark version 3.1.0, but you're using version 3.0.3. If you look at the documentation, you can see this mention about version support:
An expression that adds/replaces a field in
StructTypeby name.New in version 3.1.0.
For older versions, you need to recreate the struct column a in order to update a field:
from pyspark.sql import functions as F
df.withColumn(
'a',
F.struct(F.lit(3).alias("b"), F.col("a.c").alias("c"))
).select('a.b').show()
#+---+
#| b|
#+---+
#| 3|
#+---+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With