I am trying to parse a JSON file, selectively read only 50+ data elements (out of 800+) into DataFrame in PySpark. One of the data elements (issues.customfield_666) is a Struct Type (with 3 fields Id/Name/Tag under it). Sometimes data in this Struct field comes as null. When that happens, spark job execution fails with the below error. How to ignore/suppress this error for null values?
Error is happening only when parsing JSON file #1 (where customfield_66 is coming as null).
AnalysisException: Can't extract value from issues.customfield_666: need struct type but got string
JSON File 1 (Where customfield_666 has only null)
{
"startAt": 0,
"total": 1,
"issues": [
{
"id": "1",
"key": "BSE-444",
"issuetype": {
"id": "30",
"name": "Epic1",
},
"customfield_666": null
}
]
}
JSON File 2 (Where customfield_666 has both null and struct values)
{
"startAt": 0,
"total": 2,
"issues": [
{
"id": "1",
"key": "BSE-444",
"issuetype": {
"id": "30",
"name": "Epic1",
},
"customfield_666": null
},
{
"id": "2",
"key": "BSE-555",
"issuetype": {
"id": "40",
"name": "Epic2",
},
"customfield_666":
{
"tag": "Smoke Testing",
"id": "666-01",
},
}
]
}
Below is the PySpark code used to parse above JSON data.
from pyspark.sql.functions import *
rawDF = spark.read.json("abfss://[email protected]/raw/MyData.json", multiLine = "true")
DF = rawDF.select(explode("issues").alias("issues")) \
.select(
col("issues.id").alias("IssueId"),
col("issues.key").alias("IssueKey"),
col("issues.fields").alias("IssueFields"),
col("issues.issuetype.name").alias("IssueTypeName"),
col("issues.customfield_666.tag").alias("IssueCust666Tag")
)
You may check if it is null first
from pyspark.sql import functions as F
DF = rawDF.select(F.explode("issues").alias("issues")) \
.select(
F.col("issues.id").alias("IssueId"),
F.col("issues.key").alias("IssueKey"),
F.col("issues.fields").alias("IssueFields"),
F.col("issues.issuetype.name").alias("IssueTypeName"),
F.when(
F.col("issues.customfield_666").isNull() | (F.trim(F.col("issues.customfield_666").cast("string"))==""), None
).otherwise(
F.col("issues.customfield_666.tag")
).alias("IssueCust666Tag")
)
Let me know if this works for you
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With