Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark Dataframe - How to get a particular field from a struct type column

I have a data frame with a structure like this:

root
 |-- npaDetails: struct (nullable = true)
 |    |-- additionalInformation: struct (nullable = true)
 |    |-- npaStatus: struct (nullable = true)
 |    |-- npaDetails: struct (nullable = true)
 |-- npaHeaderData: struct (nullable = true)
 |    |-- npaNumber: string (nullable = true)
 |    |-- npaDownloadDate: string (nullable = true)     
 |    |-- npaDownloadTime: string (nullable = true) 

I want to retrieve all npaNumber from all the rows in the dataframe.

My approach was to iterate over all rows in the data frame to extract for each one the value stored in the column npaHeaderData, in the field npaNumber. So I code the following lines:

parquetFileDF.foreach { newRow =>  

  //To retrieve the second column
  val column = newRow.get(1)

  //The following line is not allowed
  //val npaNumber= column.getAs[String]("npaNumber")  

  println(column)

}

The content of column printed in each iteration looks like:

[207400956,27FEB17,09.30.00]

But column is of type Any and I am not able extract any of its fields. Can anyone tell what am I doing wrong or what approach should I follow instead of this?

Thanks

like image 968
Ignacio Alorre Avatar asked Jan 25 '26 22:01

Ignacio Alorre


1 Answers

if you are looking to extract only npaNumber then you can do

parquetFileDF.select($"npaHeaderData.npaNumber".as("npaNumber"))

you should have a dataframe with npaNumber column only.

like image 99
Ramesh Maharjan Avatar answered Jan 28 '26 13:01

Ramesh Maharjan