PySpark / Spark SQL DataFrame - Error while parsing Struct Type when data is null

Question

I am trying to parse a JSON file, selectively read only 50+ data elements (out of 800+) into DataFrame in PySpark. One of the data elements (issues.customfield_666) is a Struct Type (with 3 fields Id/Name/Tag under it). Sometimes data in this Struct field comes as null. When that happens, spark job execution fails with the below error. How to ignore/suppress this error for null values?

Error is happening only when parsing JSON file #1 (where customfield_66 is coming as null).

AnalysisException: Can't extract value from issues.customfield_666: need struct type but got string

JSON File 1 (Where customfield_666 has only null)
{
    "startAt": 0,
    "total": 1,
    "issues": [
            {
            "id": "1",
            "key": "BSE-444",
            "issuetype": {
                            "id": "30",
                            "name": "Epic1",
                         },         
            "customfield_666": null
            }
        ]
}


JSON File 2 (Where customfield_666 has both null and struct values)
{
    "startAt": 0,
    "total": 2,
    "issues": [
            {
            "id": "1",
            "key": "BSE-444",
            "issuetype": {
                            "id": "30",
                            "name": "Epic1",
                         },         
            "customfield_666": null
            },
            {
            "id": "2",
            "key": "BSE-555",
            "issuetype": {
                            "id": "40",
                            "name": "Epic2",
                         },         
            "customfield_666":  
                            {
                                "tag":  "Smoke Testing",
                                "id":  "666-01",
                            },      
            }           
            
        ]
}

Below is the PySpark code used to parse above JSON data.

from pyspark.sql.functions import *

rawDF = spark.read.json("abfss://[email protected]/raw/MyData.json", multiLine = "true")

DF = rawDF.select(explode("issues").alias("issues")) \
                .select(
                       col("issues.id").alias("IssueId"), 
                       col("issues.key").alias("IssueKey"), 
                       col("issues.fields").alias("IssueFields"),
                       col("issues.issuetype.name").alias("IssueTypeName"),
                       col("issues.customfield_666.tag").alias("IssueCust666Tag")
                      )

ggordon · Accepted Answer

You may check if it is null first

from pyspark.sql import functions as F

DF = rawDF.select(F.explode("issues").alias("issues")) \
          .select(
              F.col("issues.id").alias("IssueId"), 
              F.col("issues.key").alias("IssueKey"), 
              F.col("issues.fields").alias("IssueFields"),
              F.col("issues.issuetype.name").alias("IssueTypeName"),
              F.when(
                  F.col("issues.customfield_666").isNull() |  (F.trim(F.col("issues.customfield_666").cast("string"))==""), None
              ).otherwise(
                  F.col("issues.customfield_666.tag")
              ).alias("IssueCust666Tag")
         )

Let me know if this works for you

PySpark / Spark SQL DataFrame - Error while parsing Struct Type when data is null

Tags:

dataframe

apache-spark

apache-spark-sql

pyspark

azure-databricks

Kris

1 Answers

ggordon

Recent Activity

Donate For Us

PySpark / Spark SQL DataFrame - Error while parsing Struct Type when data is null

Tags:

dataframe

apache-spark

apache-spark-sql

pyspark

azure-databricks

Kris

1 Answers

ggordon

Related questions

Recent Activity

Donate For Us