Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PySpark / Spark SQL DataFrame - Error while parsing Struct Type when data is null

I am trying to parse a JSON file, selectively read only 50+ data elements (out of 800+) into DataFrame in PySpark. One of the data elements (issues.customfield_666) is a Struct Type (with 3 fields Id/Name/Tag under it). Sometimes data in this Struct field comes as null. When that happens, spark job execution fails with the below error. How to ignore/suppress this error for null values?

Error is happening only when parsing JSON file #1 (where customfield_66 is coming as null).

AnalysisException: Can't extract value from issues.customfield_666: need struct type but got string

JSON File 1 (Where customfield_666 has only null)
{
    "startAt": 0,
    "total": 1,
    "issues": [
            {
            "id": "1",
            "key": "BSE-444",
            "issuetype": {
                            "id": "30",
                            "name": "Epic1",
                         },         
            "customfield_666": null
            }
        ]
}


JSON File 2 (Where customfield_666 has both null and struct values)
{
    "startAt": 0,
    "total": 2,
    "issues": [
            {
            "id": "1",
            "key": "BSE-444",
            "issuetype": {
                            "id": "30",
                            "name": "Epic1",
                         },         
            "customfield_666": null
            },
            {
            "id": "2",
            "key": "BSE-555",
            "issuetype": {
                            "id": "40",
                            "name": "Epic2",
                         },         
            "customfield_666":  
                            {
                                "tag":  "Smoke Testing",
                                "id":  "666-01",
                            },      
            }           
            
        ]
}

Below is the PySpark code used to parse above JSON data.

from pyspark.sql.functions import *

rawDF = spark.read.json("abfss://[email protected]/raw/MyData.json", multiLine = "true")

DF = rawDF.select(explode("issues").alias("issues")) \
                .select(
                       col("issues.id").alias("IssueId"), 
                       col("issues.key").alias("IssueKey"), 
                       col("issues.fields").alias("IssueFields"),
                       col("issues.issuetype.name").alias("IssueTypeName"),
                       col("issues.customfield_666.tag").alias("IssueCust666Tag")
                      )
like image 252
Kris Avatar asked Dec 06 '25 04:12

Kris


1 Answers

You may check if it is null first

from pyspark.sql import functions as F

DF = rawDF.select(F.explode("issues").alias("issues")) \
          .select(
              F.col("issues.id").alias("IssueId"), 
              F.col("issues.key").alias("IssueKey"), 
              F.col("issues.fields").alias("IssueFields"),
              F.col("issues.issuetype.name").alias("IssueTypeName"),
              F.when(
                  F.col("issues.customfield_666").isNull() |  (F.trim(F.col("issues.customfield_666").cast("string"))==""), None
              ).otherwise(
                  F.col("issues.customfield_666.tag")
              ).alias("IssueCust666Tag")
         )

Let me know if this works for you

like image 79
ggordon Avatar answered Dec 08 '25 20:12

ggordon



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!