Pyspark remove field in struct column

Question

I want to remove a part of a value in a struct and save that version of the value as a new column in my dataframe, which looks something like this:

column
{"A": "2022-01-26T14:21:32.214+0000", "B": 69, "C": {"CA": 42, "CB": "Hello"}, "D": "XD"}

I want to remove the field C and it's values and save the rest as one new column without dividing A, B, D fields into different columns. What I want should look like this:

column	newColumn
{"A": "2022-01-26T14:21:32.214+0000", "B": 69, "C": {"CA": 42, "CB": "Hello"}, "D": "XD"}	{"A": "2022-01-26T14:21:32.214+0000", "B": 69, "D": "XD"}

I have successfully removed C by converting my dataframe to a dict, but now I can't manage to convert it back into ONE column. My attempt at removing C looks like this:

dfTemp = df.select('column').collect()[0][0].asDict(True)
dfDict = {}
for k in dfTemp:
    if k != 'C':
        dfDict[k] = dfTemp[k]

If you have a better way to remove a part of struct like mine and keeping the result in one column and not adding more rows or if you know how to convert a dict to a dataframe without dividing the key and value pairs into separate columns please leave a suggestion.

blackbishop · Accepted Answer

Assuming your column is of type string and contains json, you can first parse it into StructType using from_json like this:

df = spark.createDataFrame([
    ('{"A": "2022-01-26T14:21:32.214+0000", "B": 69, "C": {"CA": 42, "CB": "Hello"}, "D": "XD"}',)
], ["column"])

df = df.withColumn(
    "parsed_column",
    F.from_json("column", "struct<A:string,B:int,C:struct<A:int,CB:string>,D:string>")
)

Now removing the field C from the struct column:

Spark >=3.1

Use dropFields method:

result = df.withColumn("newColumn", F.to_json(F.col("parsed_column").dropFields("C"))).drop("parsed_column")

result.show(truncate=False)

#+-----------------------------------------------------------------------------------------+----------------------------------------------------+
#|column                                                                                   |newColumn                                           |
#+-----------------------------------------------------------------------------------------+----------------------------------------------------+
#|{"A": "2022-01-26T14:21:32.214+0000", "B": 69, "C": {"CA": 42, "CB": "Hello"}, "D": "XD"}|{"A":"2022-01-26T14:21:32.214+0000","B":69,"D":"XD"}|
#+-----------------------------------------------------------------------------------------+----------------------------------------------------+

Spark <3.1

Recreate the struct column and filter the field C

result = df.withColumn(
    "newColumn",
    F.to_json(
        F.struct(*[
            F.col(f"parsed_column.{c}").alias(c)
            for c in df.selectExpr("parsed_column.*").columns if c != 'C'
        ])
    )
).drop("parsed_column")

Another method by parsing the json string values into MapType then applying function map_filter to remove key C:

result = df.withColumn(
    "newColumn",
    F.to_json(
        F.map_filter(
            F.from_json("column", "map<string,string>"),
            lambda k, v: k != "C"
        )
    )

Pyspark remove field in struct column

Tags:

dataframe

apache-spark

apache-spark-sql

pyspark

databricks

Pontus Johansson

1 Answers

Spark >=3.1

Spark <3.1

blackbishop

Recent Activity

Donate For Us

Pyspark remove field in struct column

Tags:

dataframe

apache-spark

apache-spark-sql

pyspark

databricks

Pontus Johansson

1 Answers

Spark >=3.1

Spark <3.1

blackbishop

Related questions

Recent Activity

Donate For Us