PySpark: How to extract variables from a struct nested in a struct inside an array?

Question

The following is a toy example that is a subset of my actual data's schema. I abbreviated it for brevity.

I am looking to build a PySpark dataframe that contains 3 fields: ID, Type and TIMESTAMP that I would then save as a Hive Table. I am struggling with the PySpark code to extract the relevant columns.

 |-- Records: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- FileID: long (nullable = true)
 |    |    |-- SrcFields: struct (nullable = true)
 |    |    |    |-- ID: string (nullable = true)
 |    |    |    |-- Type: string (nullable = true)
 |    |    |    |-- TIMESTAMP: string (nullable = true)

Thus far, I imagine my solution should look something like:

from pyspark.sql.functions import col, explode

df.withColumn("values", explode("values")).select(
  "*", col("values")["name"].alias("name"), col("values")["id"].alias("id")
)

However, the solution above doesn't account for the extra nesting of my use-case and I'm unable to figure out the additional syntax required.

David Vrba · Accepted Answer

In PySpark you can access subfields of a struct using dot notation. So something like this should work:

Explode the array
Use the dot notation to get the subfields of struct

(
  df.withColumn("values", explode("Records"))
  .select(
    col("values.SrcFields.ID").alias("id"), 
    col("values.SrcFields.Type").alias("type"), 
    col("values.SrcFields.TIMESTAMP").alias("timestamp")
  )
)

PySpark: How to extract variables from a struct nested in a struct inside an array?

Tags:

python

dataframe

apache-spark-sql

pyspark

user2205916

1 Answers

David Vrba

Recent Activity

Donate For Us

PySpark: How to extract variables from a struct nested in a struct inside an array?

Tags:

python

dataframe

apache-spark-sql

pyspark

user2205916

1 Answers

David Vrba

Related questions

Recent Activity

Donate For Us