I want to remove a part of a value in a struct and save that version of the value as a new column in my dataframe, which looks something like this:
column |
---|
{"A": "2022-01-26T14:21:32.214+0000", "B": 69, "C": {"CA": 42, "CB": "Hello"}, "D": "XD"} |
I want to remove the field C
and it's values and save the rest as one new column without dividing A
, B
, D
fields into different columns. What I want should look like this:
column | newColumn |
---|---|
{"A": "2022-01-26T14:21:32.214+0000", "B": 69, "C": {"CA": 42, "CB": "Hello"}, "D": "XD"} | {"A": "2022-01-26T14:21:32.214+0000", "B": 69, "D": "XD"} |
I have successfully removed C
by converting my dataframe to a dict, but now I can't manage to convert it back into ONE column. My attempt at removing C
looks like this:
dfTemp = df.select('column').collect()[0][0].asDict(True)
dfDict = {}
for k in dfTemp:
if k != 'C':
dfDict[k] = dfTemp[k]
If you have a better way to remove a part of struct like mine and keeping the result in one column and not adding more rows or if you know how to convert a dict to a dataframe without dividing the key and value pairs into separate columns please leave a suggestion.
Assuming your column is of type string and contains json, you can first parse it into StructType using from_json
like this:
df = spark.createDataFrame([
('{"A": "2022-01-26T14:21:32.214+0000", "B": 69, "C": {"CA": 42, "CB": "Hello"}, "D": "XD"}',)
], ["column"])
df = df.withColumn(
"parsed_column",
F.from_json("column", "struct<A:string,B:int,C:struct<A:int,CB:string>,D:string>")
)
Now removing the field C
from the struct column:
Use dropFields
method:
result = df.withColumn("newColumn", F.to_json(F.col("parsed_column").dropFields("C"))).drop("parsed_column")
result.show(truncate=False)
#+-----------------------------------------------------------------------------------------+----------------------------------------------------+
#|column |newColumn |
#+-----------------------------------------------------------------------------------------+----------------------------------------------------+
#|{"A": "2022-01-26T14:21:32.214+0000", "B": 69, "C": {"CA": 42, "CB": "Hello"}, "D": "XD"}|{"A":"2022-01-26T14:21:32.214+0000","B":69,"D":"XD"}|
#+-----------------------------------------------------------------------------------------+----------------------------------------------------+
Recreate the struct column and filter the field C
result = df.withColumn(
"newColumn",
F.to_json(
F.struct(*[
F.col(f"parsed_column.{c}").alias(c)
for c in df.selectExpr("parsed_column.*").columns if c != 'C'
])
)
).drop("parsed_column")
Another method by parsing the json string values into MapType then applying function map_filter
to remove key C
:
result = df.withColumn(
"newColumn",
F.to_json(
F.map_filter(
F.from_json("column", "map<string,string>"),
lambda k, v: k != "C"
)
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With