In Pyspark I want to save a dataframe as a json file but in the below format
Say this is my dataframe
>>> rdd1.show()
+----------+-----+
| f1| f2|
+----------+-----+
|AAAAAAAAAA|99999|
| BBBBBBBBB|99999|
| CCCCCCCCC|99999|
+----------+-----+
if I save the above dataframe as a json file it gives an output like below
>>>rdd1.coalesce(1).write.json("file:///test_directory/sample4")
{"f1":"AAAAAAAAAA","f2":"99999"}
{"f1":"BBBBBBBBB","f2":"99999"}
{"f1":"CCCCCCCCC","f2":"99999"}
But I want it like the below
[{"f1":"AAAAAAAAAA","f2":"99999"},{"f1":"BBBBBBBBB","f2":"99999"},{"f1":"CCCCCCCCC","f2":"99999"}]
I have tried option("multiLine", "true") and lineSep="," none seems to be working, these options are working only for read not write. Please suggest a solution for this problem
Use to_json with collect_list function and write as .text().
Example:
df.show()
#+-----+-----+
#| f1| f2|
#+-----+-----+
#|AAAAA| 9999|
#| BBB|99999|
#| CCCC| 9999|
#+-----+-----+
from pyspark.sql.functions import *
df.agg(to_json(collect_list(struct(col("f1"),col("f2")))).alias("d")).\
write.\
mode("overwrite").\
text("<path>")
#output
#[{"f1":"AAAAA","f2":"9999"},{"f1":"BBB","f2":"99999"},{"f1":"CCCC","f2":"9999"}]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With