Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark is not loading all multiline json objects in a single file even with multiline option set to true

My json file looks like below, it has got two multiline json objects (in a single file)

{
    "name":"John Doe",
    "id":"123456"
}
{
    "name":"Jane Doe",
    "id":"456789"
}

So when i load multiline json dataframe it should load two json instead it is loading first json object only. How can i load all the multiline json objects in a single file?

val rawData = spark.read.option("multiline", true).option("mode", "PERMISSIVE").format("json").load("/tmp/search/baggage/test/1")
scala> rawData.show
+------+--------+
|    id|    name|
+------+--------+
|123456|John Doe|
+------+--------+

scala> rawData.count
res20: Long = 1

like image 787
Despicable me Avatar asked Dec 18 '25 16:12

Despicable me


1 Answers

Your input JSON is not valid, it misses brackets as you have multiples objects. You can check this using any json validator tool. That's why multiLine option won't work in this case.

That said, I think you want to use JsonLines format where each line represents a JSON object.

{"name":"John Doe","id":"123456"}
{"name":"Jane Doe","id":"456789"}

Spark can read this JSON without setting multiline option:

val df = spark.read.json("file:///your/json/file.json")
df.show()

Output :

+------+--------+
|    id|    name|
+------+--------+
|123456|John Doe|
|456789|Jane Doe|
+------+--------+
like image 152
blackbishop Avatar answered Dec 20 '25 07:12

blackbishop



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!