Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create array containing first element of each struct in an array in a Spark dataframe field

How do I go from an array of structs to an array of the first element of each struct, within a PySpark dataframe?

An example will make this clearer. Let's say I have the dataframe defined as follows:

scoresheet = spark.createDataFrame([("Alice", [("Math",100),("English",80)]),("Bob", [("Math", 90)]),("Charlie", [])],["name","scores"])

The schema and the dataframe defined by the above look as follows:

root
 |-- name: string (nullable = true)
 |-- scores: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _1: string (nullable = true)
 |    |    |-- _2: long (nullable = true)

+-------+--------------------------+
|name   |scores                    |
+-------+--------------------------+
|Alice  |[[Math,100], [English,80]]|
|Bob    |[[Math,90]]               |
|Charlie|[]                        |
+-------+--------------------------+

You can see that the subjectwise marks are contained in an ordered struct of type (Subject,Marks) for each student. The number of subjects for each student is not constant and may be zero.

I'd like to go from this to producing a new dataframe which only contains the subjects in an array for each student, without the marks. It should produce an empty array for students with no subjects. In short, it should look like this:

+-------+---------------+
|name   |scores         |
+-------+---------------+
|Alice  |[Math, English]|
|Bob    |[Math]         |
|Charlie|[]             |
+-------+---------------+

Note that the number of rows is the same as before; so I can't use explode for this unless I regroup afterwards, which seems computationally inefficient.

like image 426
xenocyon Avatar asked Sep 05 '25 04:09

xenocyon


1 Answers

Best you can do is udf:

from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, StringType

take_first = udf(lambda rows: [row[0] for row in rows], ArrayType(StringType()))

scoresheet.withColumn("scores", take_first("scores"))
like image 63
user7322720 Avatar answered Sep 08 '25 01:09

user7322720